from:"Denny Lee"

Re: First Time contribution.

2023-09-17 Thread Denny Lee

Hi Ram,

We have some good guidance at
https://spark.apache.org/contributing.html

HTH!
Denny


On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:

>
>
>
> Hello All,
> Recently, joined this community and would like to contribute. Is there a
> guideline or recommendation on tasks that can be picked up by a first timer
> or a started task?.
>
> Tried looking at stack overflow tag: apache-spark
> , couldn't find
> any information for first time contributors.
>
> Looking forward to learning and contributing.
>
> Thanks
> Ram
>

Re: Slack for PySpark users

2023-04-03 Thread Denny Lee

;
>>>>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>>>>> section of "Open Communications", restriction of communication inside ASF
>>>>>> INFRA (mailing list) is more about code and decision-making.
>>>>>>
>>>>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>>>>
>>>>>> It's unavoidable if "users" prefer to use an alternative
>>>>>> communication mechanism rather than the user mailing list. Before Stack
>>>>>> Overflow days, there had been a meaningful number of questions around 
>>>>>> user@.
>>>>>> It's just impossible to let them go back and post to the user mailing 
>>>>>> list.
>>>>>>
>>>>>> We just need to make sure it is not the purpose of employing Slack to
>>>>>> move all discussions about developments, direction of the project, etc
>>>>>> which must happen in dev@/private@. The purpose of Slack thread here
>>>>>> does not seem to aim to serve the purpose.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Good discussions and proposals.all around.
>>>>>>>
>>>>>>> I have used slack in anger on a customer site before. For small and
>>>>>>> medium size groups it is good and affordable. Alternatives have been
>>>>>>> suggested as well so those who like investigative search can agree and 
>>>>>>> come
>>>>>>> up with a freebie one.
>>>>>>> I am inclined to agree with Bjorn that this slack has more social
>>>>>>> dimensions than the mailing list. It is akin to a sports club using
>>>>>>> WhatsApp groups for communication. Remember we were originally looking 
>>>>>>> for
>>>>>>> space for webinars, including Spark on Linkedin that Denney Lee 
>>>>>>> suggested.
>>>>>>> I think Slack and mailing groups can coexist happily. On a more serious
>>>>>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>>>>>> traffic. Currently we hardly get many mails daily <> less than 5. So 
>>>>>>> having
>>>>>>> a slack type medium may improve members participation.
>>>>>>>
>>>>>>> so +1 for me as well.
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>> Palantir Technologies Limited
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 30 Mar 2023 at 22:19, Denny Lee 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1.
>>>>>>>>
>>>>>>>> To Shani’s point, there are multiple OSS projects that use the free
>>>>>>>> Slack version - top of mind include Delta, Presto, Flink, Trino, 
>>>>>>>> Datahub,
>>>>>>>> MLflow, etc.
>>>>>>>>
>>>>>>>> On Thu, Mar 30, 2023 at 14:15  wrote:
>>>>>>

Re: Slack for PySpark users

2023-03-30 Thread Denny Lee

ng
>>>> list because we didn't set up any rule here yet.
>>>>
>>>> To Xiao. I understand what you mean. That's the reason why I added
>>>> Matei from your side.
>>>> > I did not see an objection from the ASF board.
>>>>
>>>> There is on-going discussion about the communication channels outside
>>>> ASF email which is specifically concerning Slack.
>>>> Please hold on any official action for this topic. We will know how to
>>>> support it seamlessly.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Mar 30, 2023 at 9:21 AM Xiao Li  wrote:
>>>>
>>>>> Hi, Dongjoon,
>>>>>
>>>>> The other communities (e.g., Pinot, Druid, Flink) created their own
>>>>> Slack workspaces last year. I did not see an objection from the ASF board.
>>>>> At the same time, Slack workspaces are very popular and useful in most
>>>>> non-ASF open source communities. TBH, we are kind of late. I think we can
>>>>> do the same in our community?
>>>>>
>>>>> We can follow the guide when the ASF has an official process for ASF
>>>>> archiving. Since our PMC are the owner of the slack workspace, we can make
>>>>> a change based on the policy. WDYT?
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> Dongjoon Hyun  于2023年3月30日周四 09:03写道：
>>>>>
>>>>>> Hi, Xiao and all.
>>>>>>
>>>>>> (cc Matei)
>>>>>>
>>>>>> Please hold on the vote.
>>>>>>
>>>>>> There is a concern expressed by ASF board because recent Slack
>>>>>> activities created an isolated silo outside of ASF mailing list archive.
>>>>>>
>>>>>> We need to establish a way to embrace it back to ASF archive before
>>>>>> starting anything official.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> + @d...@spark.apache.org 
>>>>>>>
>>>>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>>>>>>> Flink) have created their own dedicated Slack workspaces for faster
>>>>>>> communication. We can do the same in Apache Spark. The Slack workspace 
>>>>>>> will
>>>>>>> be maintained by the Apache Spark PMC. I propose to initiate a vote for 
>>>>>>> the
>>>>>>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mich Talebzadeh  于2023年3月28日周二 07:07写道：
>>>>>>>
>>>>>>>> I created one at slack called pyspark
>>>>>>>>
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>>> Palantir Technologies Limited
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 good idea, I d like to join as well.
>>>>>>>>>
>>>>>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai 
>>>>>>>>> a écrit :
>>>>>>>>>
>>>>>>>>>> Please let us know when the channel is created. I'd like to join
>>>>>>>>>> :)
>>>>>>>>>>
>>>>>>>>>> Thank You & Best Regards
>>>>>>>>>> Winston Lai
>>>>>>>>>> --
>>>>>>>>>> *From:* Denny Lee 
>>>>>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>>>>>>> *To:* Hyukjin Kwon 
>>>>>>>>>> *Cc:* keen ; user@spark.apache.org <
>>>>>>>>>> user@spark.apache.org>
>>>>>>>>>> *Subject:* Re: Slack for PySpark users
>>>>>>>>>>
>>>>>>>>>> +1 I think this is a great idea!
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Yeah, actually I think we should better have a slack channel so
>>>>>>>>>> we can easily discuss with users and developers.
>>>>>>>>>>
>>>>>>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>> I really like *Slack *as communication channel for a tech
>>>>>>>>>> community.
>>>>>>>>>> There is a Slack workspace for *delta lake users* (
>>>>>>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>>>>>>> I was wondering if there is something similar for PySpark users.
>>>>>>>>>>
>>>>>>>>>> If not, would there be anything wrong with creating a new
>>>>>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that 
>>>>>>>>>> this is
>>>>>>>>>> *not* officially part of Apache Spark)?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Asma ZGOLLI
>>>>>>>>>
>>>>>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4
>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail=g>,
>> 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Slack for PySpark users

2023-03-27 Thread Denny Lee

+1 I think this is a great idea!

On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon  wrote:

> Yeah, actually I think we should better have a slack channel so we can
> easily discuss with users and developers.
>
> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>
>> Hi all,
>> I really like *Slack *as communication channel for a tech community.
>> There is a Slack workspace for *delta lake users* (
>> https://go.delta.io/slack) that I enjoy a lot.
>> I was wondering if there is something similar for PySpark users.
>>
>> If not, would there be anything wrong with creating a new Slack workspace
>> for PySpark users? (when explicitly mentioning that this is *not*
>> officially part of Apache Spark)?
>>
>> Cheers
>> Martin
>>
>

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee

What we can do is get into the habit of compiling the list on LinkedIn but
making sure this list is shared and broadcast here, eh?!

As well, when we broadcast the videos, we can do this using zoom/jitsi/
riverside.fm as well as simulcasting this on LinkedIn. This way you can
view directly on the former without ever logging in with a user ID.

HTH!!

On Wed, Mar 15, 2023 at 4:30 PM Mich Talebzadeh 
wrote:

> Understood Nitin It would be wrong to act against one's conviction. I am
> sure we can find a way around providing the contents
>
> Regards
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Mar 2023 at 22:34, Nitin Bhansali 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for your prompt response ... much appreciated. I know how to and
>> can create login IDs on such sites but I had taken conscious decision some
>> 20 years ago ( and i will be going against my principles) not to be on such
>> sites. Hence I had asked for is there any other way I can join/view
>> recording of webinar.
>>
>> Anyways not to worry.
>>
>> Thanks & Regards
>>
>> Nitin.
>>
>>
>> On Wednesday, 15 March 2023 at 20:37:55 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi Nitin,
>>
>> Linkedin is more of a professional media.  FYI, I am only a member of
>> Linkedin, no facebook, etc.There is no reason for you NOT to create a
>> profile for yourself  in linkedin :)
>>
>>
>> https://www.linkedin.com/help/linkedin/answer/a1338223/sign-up-to-join-linkedin?lang=en
>>
>> see you there as well.
>>
>> Best of luck.
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead,
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali 
>> wrote:
>>
>> Hello Mich,
>>
>> My apologies  ...  but I am not on any of such social/professional sites?
>> Any other way to access such webinars/classes?
>>
>> Thanks & Regards
>> Nitin.
>>
>> On Wednesday, 15 March 2023 at 18:26:51 GMT, Denny Lee <
>> denny.g@gmail.com> wrote:
>>
>>
>> Thanks Mich for tackling this!  I encourage everyone to add to the list
>> so we can have a comprehensive list of topics, eh?!
>>
>> On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
>> wrote:
>>
>> Hi all,
>>
>> Thanks to @Denny Lee   to give access to
>>
>> https://www.linkedin.com/company/apachespark/
>>
>> and contribution from @asma zgolli 
>>
>> You will see my post at the bottom. Please add anything else on topics to
>> the list as a comment.
>>
>> We will then put them together in an article perhaps. Comments and
>> contributions are welcome.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead,
>> Palantir Technologies Limited
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee

Thanks Mich for tackling this!  I encourage everyone to add to the list so
we can have a comprehensive list of topics, eh?!

On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
wrote:

> Hi all,
>
> Thanks to @Denny Lee   to give access to
>
> https://www.linkedin.com/company/apachespark/
>
> and contribution from @asma zgolli 
>
> You will see my post at the bottom. Please add anything else on topics to
> the list as a comment.
>
> We will then put them together in an article perhaps. Comments and
> contributions are welcome.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead,
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh 
> wrote:
>
>> Hi Denny,
>>
>> That Apache Spark Linkedin page
>> https://www.linkedin.com/company/apachespark/ looks fine. It also allows
>> a wider audience to benefit from it.
>>
>> +1 for me
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>>
>>> In the past, we've been using the Apache Spark LinkedIn page
>>> <https://www.linkedin.com/company/apachespark/> and group to broadcast
>>> these type of events - if you're cool with this?  Or we could go through
>>> the process of submitting and updating the current
>>> https://spark.apache.org or request to leverage the original Spark
>>> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>>>  WDYT?
>>>
>>> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well that needs to be created first for this purpose. The appropriate
>>>> name etc. to be decided. Maybe @Denny Lee   can
>>>> facilitate this as he offered his help.
>>>>
>>>>
>>>> cheers
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>>>>
>>>>> Hello Mich,
>>>>>
>>>>> Can you please provide the link for the confluence page?
>>>>>
>>>>> Many thanks
>>>>> Asma
>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>
>>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> a écrit :
>>>>>
>>>>>> Apologies I missed the list.
>>>>>>
>>>>>> To move forward I selected these topics from the thread "Online
>>>>>> classes for spark topics".
>>>>>>
>>>>>> To take this further I propose a confluence page to be seup.
>>>>>>
>>>>>>
>>>>>>1. Spark UI
>>>>>>2. Dynam

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee

In the past, we've been using the Apache Spark LinkedIn page
<https://www.linkedin.com/company/apachespark/> and group to broadcast
these type of events - if you're cool with this?  Or we could go through
the process of submitting and updating the current https://spark.apache.org
or request to leverage the original Spark confluence page
<https://cwiki.apache.org/confluence/display/SPARK>.WDYT?

On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
wrote:

> Well that needs to be created first for this purpose. The appropriate name
> etc. to be decided. Maybe @Denny Lee   can
> facilitate this as he offered his help.
>
>
> cheers
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>
>> Hello Mich,
>>
>> Can you please provide the link for the confluence page?
>>
>> Many thanks
>> Asma
>> Ph.D. in Big Data - Applied Machine Learning
>>
>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
>> a écrit :
>>
>>> Apologies I missed the list.
>>>
>>> To move forward I selected these topics from the thread "Online classes
>>> for spark topics".
>>>
>>> To take this further I propose a confluence page to be seup.
>>>
>>>
>>>1. Spark UI
>>>2. Dynamic allocation
>>>3. Tuning of jobs
>>>4. Collecting spark metrics for monitoring and alerting
>>>5.  For those who prefer to use Pandas API on Spark since the
>>>release of Spark 3.2, What are some important notes for those users? For
>>>example, what are the additional factors affecting the Spark performance
>>>using Pandas API on Spark? How to tune them in addition to the 
>>> conventional
>>>Spark tuning methods applied to Spark SQL users.
>>>6. Spark internals and/or comparing spark 3 and 2
>>>7. Spark Streaming & Spark Structured Streaming
>>>8. Spark on notebooks
>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>10. Spark on k8s
>>>
>>> Opinions and how to is welcome
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Hi guys
>>>>
>>>> To move forward I selected these topics from the thread "Online classes
>>>> for spark topics".
>>>>
>>>> To take this further I propose a confluence page to be seup.
>>>>
>>>> Opinions and how to is welcome
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>
>>
>>

Re: Online classes for spark topics

2023-03-12 Thread Denny Lee

Looks like we have some good topics here - I'm glad to help with setting up
the infrastructure to broadcast if it helps?

On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani 
wrote:

> I am happy to be a part of this discussion as well.
>
> Regards,
> Neeraj
>
> On Wed, 8 Mar 2023 at 22:41, Winston Lai  wrote:
>
>> +1, any webinar on Spark related topic is appreciated 
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* asma zgolli 
>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>> *To:* karan alang 
>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
>> ashok34...@yahoo.com>; User 
>> *Subject:* Re: Online classes for spark topics
>>
>> +1
>>
>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>> écrit :
>>
>> +1 .. I'm happy to be part of these discussions as well !
>>
>>
>>
>>
>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> I guess I can schedule this work over a course of time. I for myself can
>> contribute plus learn from others.
>>
>> So +1 for me.
>>
>> Let us see if anyone else is interested.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>> wrote:
>>
>>
>> Hello Mich.
>>
>> Greetings. Would you be able to arrange for Spark Structured Streaming
>> learning webinar.?
>>
>> This is something I haven been struggling with recently. it will be very
>> helpful.
>>
>> Thanks and Regard
>>
>> AK
>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> This might  be a worthwhile exercise on the assumption that the
>> contributors will find the time and bandwidth to chip in so to speak.
>>
>> I am sure there are many but on top of my head I can think of Holden
>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>> experienced.
>>
>> Anyone else 樂
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>  wrote:
>>
>> Hello gurus,
>>
>> Does Spark arranges online webinars for special topics like Spark on K8s,
>> data science and Spark Structured Streaming?
>>
>> I would be most grateful if experts can share their experience with
>> learners with intermediate knowledge like myself. Hopefully we will find
>> the practical experiences told valuable.
>>
>> Respectively,
>>
>> AK
>>
>>
>>
>>
>

Re: Online classes for spark topics

2023-03-08 Thread Denny Lee

We used to run Spark webinars on the Apache Spark LinkedIn group
 but
honestly the turnout was pretty low.  We had dove into various features.
If there are particular topics that. you would like to discuss during a
live session, please let me know and we can try to restart them.  HTH!

On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:

> +1
>
> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai  wrote:
>
>> +1, any webinar on Spark related topic is appreciated 
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* asma zgolli 
>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>> *To:* karan alang 
>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
>> ashok34...@yahoo.com>; User 
>> *Subject:* Re: Online classes for spark topics
>>
>> +1
>>
>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>> écrit :
>>
>> +1 .. I'm happy to be part of these discussions as well !
>>
>>
>>
>>
>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> I guess I can schedule this work over a course of time. I for myself can
>> contribute plus learn from others.
>>
>> So +1 for me.
>>
>> Let us see if anyone else is interested.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>> wrote:
>>
>>
>> Hello Mich.
>>
>> Greetings. Would you be able to arrange for Spark Structured Streaming
>> learning webinar.?
>>
>> This is something I haven been struggling with recently. it will be very
>> helpful.
>>
>> Thanks and Regard
>>
>> AK
>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> This might  be a worthwhile exercise on the assumption that the
>> contributors will find the time and bandwidth to chip in so to speak.
>>
>> I am sure there are many but on top of my head I can think of Holden
>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>> experienced.
>>
>> Anyone else 樂
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>  wrote:
>>
>> Hello gurus,
>>
>> Does Spark arranges online webinars for special topics like Spark on K8s,
>> data science and Spark Structured Streaming?
>>
>> I would be most grateful if experts can share their experience with
>> learners with intermediate knowledge like myself. Hopefully we will find
>> the practical experiences told valuable.
>>
>> Respectively,
>>
>> AK
>>
>>
>>
>>
>

Re: Prometheus with spark

2022-10-27 Thread Denny Lee

Hi Raja,

A little atypical way to respond to your question - please check out the
most recent Spark AMA where we discuss this:
https://www.linkedin.com/posts/apachespark_apachespark-ama-committers-activity-6989052811397279744-jpWH?utm_source=share_medium=member_ios

HTH!
Denny



On Tue, Oct 25, 2022 at 09:16 Raja bhupati 
wrote:

> We have use case where we would like process Prometheus metrics data with
> spark
>
> On Tue, Oct 25, 2022, 19:49 Jacek Laskowski  wrote:
>
>> Hi Raj,
>>
>> Do you want to do the following?
>>
>> spark.read.format("prometheus").load...
>>
>> I haven't heard of such a data source / format before.
>>
>> What would you like it for?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, Oct 21, 2022 at 6:12 PM Raj ks  wrote:
>>
>>> Hi Team,
>>>
>>>
>>> We wanted to query Prometheus data with spark. Any suggestions will
>>> be appreciated
>>>
>>> Searched for documents but did not got any prompt one
>>>
>>

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Denny Lee

Hi Karan,

You may want to ping Databricks Help  or
Forums  as this is a Databricks
specific question.  I'm a little surprised that a Databricks cluster would
take a long time to create so it may be best to utilize these forums to
grok the cause.

HTH!
Denny

Sent via Superhuman 

On Mon, Aug 16, 2021 at 11:10 PM, karan alang  wrote:

> Hello - i've been using the Databricks notebook(for pyspark or scala/spark
> development), and recently have had issues wherein the cluster creation
> takes a long time to get created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>

Re: Append to an existing Delta Lake using structured streaming

2021-07-21 Thread Denny Lee

Including the Delta Lake Users and Developers DL to help out.

Saying this, could you clarify how data is not being added?  By any chance
do you have any code samples to recreate this?

Sent via Superhuman 


On Wed, Jul 21, 2021 at 2:49 AM,  wrote:

> Hi all,
>   I stumbled upon an interessting problem. I have an existing Deltalake
> with data recovered from a backup and would like to append to this
> Deltalake using Spark structured streaming. This does not work. Although
> the streaming job is running no data is appended.
> If I created the original file with structured streaming than appending to
> this file with a streaming job (at least with the same job) works
> flawlessly.  Did I missunderstand something here?
>
> best regards
>Eugen Wintersberger
>

Re: How to unsubscribe

2020-05-06 Thread Denny Lee

Hi Fred,

To unsubscribe, could you please email: user-unsubscr...@spark.apache.org
(for more information, please refer to
https://spark.apache.org/community.html).

Thanks!
Denny

On Wed, May 6, 2020 at 10:12 AM Fred Liu  wrote:

> Hi guys
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> *From:* Fred Liu 
> *Sent:* Wednesday, May 6, 2020 10:10 AM
> *To:* user@spark.apache.org
> *Subject:* Unsubscribe
>
>
>
> *[External E-mail]*
>
> *CAUTION: This email originated from outside the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.*
>
>
>
>
>

Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Denny Lee

There are a number of really good datasets already available including (but
not limited to):
- South Korea COVID-19 Dataset

- 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns
Hopkins CSSE 
- COVID-19 Open Research Dataset Challenge (CORD-19)


BTW, I had co-presented in a recent tech talk on Analyzing COVID-19: Can
the Data Community Help? 

In the US, there is a good resource Coronavirus in the United States:
Mapping the COVID-19 outbreak
 and
there are various global starter projects on Reddit's r/CovidProjects
.

There are a lot of good projects that we can all help individually or
together.  I would suggest to see what hospitals/academic institutions that
are doing analysis in your local region.  Even if you're analyzing public
worldwide data,  how it acts in your local region will often be different.







On Thu, Mar 26, 2020 at 12:30 PM Rajev Agarwal 
wrote:

> Actually I thought these sites exist look at John's hopkins and
> worldometers
>
> On Thu, Mar 26, 2020, 2:27 PM Zahid Rahman  wrote:
>
>>
>> "We can then donate this to WHO or others and we can make it very modular
>> though microservices etc."
>>
>> I have no interest because there are 8 million muslims locked up in their
>> home for 8 months by the Hindutwa (Indians)
>> You didn't take any notice of them.
>> Now you are locked up in your home and you want to contribute to the WHO.
>> The same WHO and you who didn't take any notice of the 8 million Kashmiri
>> Muslims.
>> The daily rapes of women and the imprisonment and torture of  men.
>>
>> Indian is the most dangerous country for women.
>>
>>
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>>
>> On Thu, 26 Mar 2020 at 14:53, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks but nobody claimed we can fix it. However, we can all contribute
>>> to it. When it utilizes the cloud then it become a global digitization
>>> issue.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 26 Mar 2020 at 14:43, Laurent Bastien Corbeil <
>>> bastiencorb...@gmail.com> wrote:
>>>
 People in tech should be more humble and admit this is not something
 they can fix. There's already plenty of visualizations, dashboards etc
 showing the spread of the virus. This is not even a big data problem, so
 Spark would have limited use.

 On Thu, Mar 26, 2020 at 10:37 AM Sol Rodriguez 
 wrote:

> IMO it's not about technology, it's about data... if we don't have
> access to the data there's no point throwing "microservices" and "kafka" 
> at
> the problem. You might find that the most effective analysis might be
> delivered through an excel sheet ;)
> So before technology I'd suggest to get access to sources and then
> figure out how to best exploit them and deliver the information to the
> right people
>
> On Thu, Mar 26, 2020 at 2:29 PM Chenguang He 
> wrote:
>
>> Have you taken a look at this (
>> https://coronavirus.1point3acres.com/en/test  )?
>>
>> They have a visualizer with a very basic analysis of the outbreak.
>>
>> On Thu, Mar 26, 2020 at 8:54 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks.
>>>
>>> Agreed, computers are not the end but means to an end. We all have
>>> to start from somewhere. It all helps.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property which 
>>> may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Denny Lee

+1

On Fri, May 31, 2019 at 17:58 Holden Karau  wrote:

> +1
>
> On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:
>
>> +1 and the draft sounds good
>>
>> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>>
>>> Here is the draft announcement:
>>>
>>> ===
>>> Plan for dropping Python 2 support
>>>
>>> As many of you already knew, Python core development team and many
>>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>>> compatibility is an increasing burden and it essentially limits the use of
>>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>>> coming, we plan to eventually drop Python 2 support as well. The current
>>> plan is as follows:
>>>
>>> * In the next major release in 2019, we will deprecate Python 2 support.
>>> PySpark users will see a deprecation warning if Python 2 is used. We will
>>> publish a migration guide for PySpark users to migrate to Python 3.
>>> * We will drop Python 2 support in a future release in 2020, after
>>> Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is
>>> used.
>>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>>> releases will continue supporting Python 2. However, after Python 2 EOL, we
>>> might not take patches that are specific to Python 2.
>>> ===
>>>
>>> Sean helped make a pass. If it looks good, I'm going to upload it to
>>> Spark website and announce it here. Let me know if you think we should do a
>>> VOTE instead.
>>>
>>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>>>
 I created https://issues.apache.org/jira/browse/SPARK-27884 to track
 the work.

 On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
 wrote:

> We don’t usually reference a future release on website
>
> > Spark website and state that Python 2 is deprecated in Spark 3.0
>
> I suspect people will then ask when is Spark 3.0 coming out then.
> Might need to provide some clarity on that.
>

 We can say the "next major release in 2019" instead of Spark 3.0. Spark
 3.0 timeline certainly requires a new thread to discuss.


>
>
> --
> *From:* Reynold Xin 
> *Sent:* Thursday, May 30, 2019 12:59:14 AM
> *To:* shane knapp
> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
> Fen; Xiangrui Meng; dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1 on Xiangrui’s plan.
>
> On Thu, May 30, 2019 at 7:55 AM shane knapp 
> wrote:
>
>> I don't have a good sense of the overhead of continuing to support
>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>
>>> from the build/test side, it will actually be pretty easy to
>> continue support for python2.7 for spark 2.x as the feature sets won't be
>> expanding.
>>
>
>> that being said, i will be cracking a bottle of champagne when i can
>> delete all of the ansible and anaconda configs for python2.x.  :)
>>
>
 On the development side, in a future release that drops Python 2
 support we can remove code that maintains python 2/3 compatibility and
 start using python 3 only features, which is also quite exciting.


>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Denny Lee

Note the --packages option works for both PySpark and Spark (Scala).  For
the SparkLauncher class, you should be able to include packages ala:

spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")


On Sun, Feb 18, 2018 at 3:30 PM xiaobo <guxiaobo1...@qq.com> wrote:

> Hi Denny,
> The pyspark script uses the --packages option to load graphframe library,
> what about the SparkLauncher class?
>
>
>
> ------ Original --
> *From:* Denny Lee <denny.g@gmail.com>
> *Date:* Sun,Feb 18,2018 11:07 AM
> *To:* 94035420 <guxiaobo1...@qq.com>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: Does Pyspark Support Graphx?
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420 <guxiaobo1...@qq.com> wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee

Most likely not as most of the effort is currently on GraphFrames  - a
great blog post on the what GraphFrames offers can be found at:
https://databricks.com/blog/2016/03/03/introducing-graphframes.html.   Is
there a particular scenario or situation that you're addressing that
requires GraphX vs. GraphFrames?

On Sat, Feb 17, 2018 at 8:26 PM xiaobo <guxiaobo1...@qq.com> wrote:

> Thanks Denny, will it be supported in the near future?
>
>
>
> -- Original ------
> *From:* Denny Lee <denny.g@gmail.com>
> *Date:* Sun,Feb 18,2018 11:05 AM
> *To:* 94035420 <guxiaobo1...@qq.com>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: Does Pyspark Support Graphx?
>
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420 <guxiaobo1...@qq.com> wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee

That’s correct - you can use GraphFrames though as it does support PySpark.
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

> I can not find anything for graphx module in the python API document, does
> it mean it is not supported yet?
>

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Denny Lee

This is amazingly awesome! :)

On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
wrote:

> That's great!
>
>
>
> On 12 July 2017 at 12:41, Felix Cheung  wrote:
>
>> Awesome! Congrats!!
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
>> *To:* user@spark.apache.org
>> *Subject:* With 2.2.0 PySpark is now available for pip install from PyPI
>> :)
>>
>> Hi wonderful Python + Spark folks,
>>
>> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
>> published on PyPI (see https://pypi.python.org/pypi/pyspark /
>> https://twitter.com/holdenkarau/status/885207416173756417). This has
>> been a long time coming (previous releases included pip installable
>> artifacts that for a variety of reasons couldn't be published to PyPI). So
>> if you (or your friends) want to be able to work with PySpark locally on
>> your laptop you've got an easier path getting started (pip install pyspark).
>>
>> If you are setting up a standalone cluster your cluster will still need
>> the "full" Spark packaging, but the pip installed PySpark should be able to
>> work with YARN or an existing standalone cluster installation (of the same
>> version).
>>
>> Happy Sparking y'all!
>>
>> Holden :)
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

Re: Spark Shell issue on HDInsight

2017-05-14 Thread Denny Lee

Sorry for the delay, you just did as I'm with the Azure CosmosDB (formerly
DocumentDB) team.  If you'd like to make it official, why not add an issue
to the GitHub repo at https://github.com/Azure/azure-documentdb-spark/issues.
HTH!

On Thu, May 11, 2017 at 9:08 PM ayan guha <guha.a...@gmail.com> wrote:

> Works for me tooyou are a life-saver :)
>
> But the question: should/how we report this to Azure team?
>
> On Fri, May 12, 2017 at 10:32 AM, Denny Lee <denny.g@gmail.com> wrote:
>
>> I was able to repro your issue when I had downloaded the jars via blob
>> but when I downloaded them as raw, I was able to get everything up and
>> running.  For example:
>>
>> wget https://github.com/Azure/azure-documentdb-spark/*blob*
>> /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-1.10.0.jar
>> wget https://github.com/Azure/azure-documentdb-spark/*blob*
>> /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>> resulted in the error:
>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel).
>> [init] error: error while loading , Error accessing
>> /home/sshuser/jars/test/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>>
>> Failed to initialize compiler: object java.lang.Object in compiler mirror
>> not found.
>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>> ** object programmatically, settings.usejavacp.value = true.
>>
>> But when running:
>> wget
>> https://github.com/Azure/azure-documentdb-spark/raw/master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-1.10.0.jar
>> wget
>> https://github.com/Azure/azure-documentdb-spark/raw/master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>> it was up and running:
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel).
>> 17/05/11 22:54:06 WARN SparkContext: Use an existing SparkContext, some
>> configuration may not take effect.
>> Spark context Web UI available at http://10.0.0.22:4040
>> Spark context available as 'sc' (master = yarn, app id =
>> application_1494248502247_0013).
>> Spark session available as 'spark'.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2.2.5.4.0-121
>>   /_/
>>
>> Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala>
>>
>> HTH!
>>
>>
>> On Wed, May 10, 2017 at 11:49 PM ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Thanks for reply, but unfortunately did not work. I am getting same
>>> error.
>>>
>>> sshuser@ed0-svochd:~/azure-spark-docdb-test$ spark-shell --jars
>>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel).
>>> [init] error: error while loading , Error accessing
>>> /home/sshuser/azure-spark-docdb-test/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatica

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee

ala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
> at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sshuser@ed0-svochd:~/azure-spark-docdb-test$
>
>
> On Mon, May 8, 2017 at 11:50 PM, Denny Lee <denny.g@gmail.com> wrote:
>
>> This appears to be an issue with the Spark to DocumentDB connector,
>> specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
>> see if you're still getting the same error?  i.e.
>>
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>>
>> On Mon, May 8, 2017 at 5:01 AM ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am facing an issue while trying to use azure-document-db connector
>>> from Microsoft. Instructions/Github
>>> <https://github.com/Azure/azure-documentdb-spark/wiki/Azure-DocumentDB-Spark-Connector-User-Guide>
>>> .
>>>
>>> Error while trying to add jar in spark-shell:
>>>
>>> spark-shell --jars
>>> azure-documentdb-spark-0.0.1.jar,azure-documentdb-1.9.6.jar
>>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel).
>>> [init] error: error while loading , Error accessing
>>> /home/sshuser/azure-spark-docdb-test/v1/azure-documentdb-spark-0.0.1.jar
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>> Exception in thread "main" java.lang.NullPointerException
>>> at
>>> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
>>>

Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee

This appears to be an issue with the Spark to DocumentDB connector,
specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
see if you're still getting the same error?  i.e.

spark-shell --master yarn --jars
azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar


On Mon, May 8, 2017 at 5:01 AM ayan guha  wrote:

> Hi
>
> I am facing an issue while trying to use azure-document-db connector from
> Microsoft. Instructions/Github
> 
> .
>
> Error while trying to add jar in spark-shell:
>
> spark-shell --jars
> azure-documentdb-spark-0.0.1.jar,azure-documentdb-1.9.6.jar
> SPARK_MAJOR_VERSION is set to 2, using Spark2
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> [init] error: error while loading , Error accessing
> /home/sshuser/azure-spark-docdb-test/v1/azure-documentdb-spark-0.0.1.jar
>
> Failed to initialize compiler: object java.lang.Object in compiler mirror
> not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
>
> Failed to initialize compiler: object java.lang.Object in compiler mirror
> not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.NullPointerException
> at
> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
> at
> scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:896)
> at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request.headerPreamble$lzycompute(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request.headerPreamble(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:918)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1337)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1336)
> at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1336)
> at
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:908)
> at
> scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:1002)
> at
> scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
> at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at
>

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Denny Lee

As well, perhaps another option could be to use the Spark Connector to
DocumentDB (https://github.com/Azure/azure-documentdb-spark) if sticking
with Scala?
On Thu, Apr 20, 2017 at 21:46 Nan Zhu  wrote:

> DocDB does have a java client? Anything prevent you using that?
>
> Get Outlook for iOS 
> --
> *From:* ayan guha 
> *Sent:* Thursday, April 20, 2017 9:24:03 PM
> *To:* Ashish Singh
> *Cc:* user
> *Subject:* Re: Azure Event Hub with Pyspark
>
> Hi
>
> yes, its only scala. I am looking for a pyspark version, as i want to
> write to documentDB which has good python integration.
>
> Thanks in advance
>
> best
> Ayan
>
> On Fri, Apr 21, 2017 at 2:02 PM, Ashish Singh 
> wrote:
>
>> Hi ,
>>
>> You can try https://github.com/hdinsight/spark-eventhubs : which is
>> eventhub receiver for spark streaming
>> We are using it but you have scala version only i guess
>>
>>
>> Thanks,
>> Ashish Singh
>>
>> On Fri, Apr 21, 2017 at 9:19 AM, ayan guha  wrote:
>>
>>> [image: Boxbe]  This message is
>>> eligible for Automatic Cleanup! (guha.a...@gmail.com) Add cleanup rule
>>> 
>>> | More info
>>> 
>>>
>>> Hi
>>>
>>> I am not able to find any conector to be used to connect spark streaming
>>> with Azure Event Hub, using pyspark.
>>>
>>> Does anyone know if there is such library/package exists>?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Support Stored By Clause

2017-03-27 Thread Denny Lee

Per SPARK-19630, wondering if there are plans to support "STORED BY" clause
for Spark 2.x?

Thanks!

Re: unsubscribe

2017-01-09 Thread Denny Lee

Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com
wrote:

Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee

Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com 
cont...@chrissmurphy.com
wrote:
PLEASE!!

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee

Generally, yes - you should try to have larger data sizes due to the
overhead of opening up files.  Typical guidance is between 64MB-1GB;
personally I usually stick with 128MB-512MB with the default of snappy
codec compression with parquet.  A good reference is Vida Ha's
presentation Data
Storage Tips for Optimal Spark Performance
.

On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:

> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from
> Spark ?
>
> As Spark app write data to parquet and it shows that under that directory
> there are heaps of very small parquet file (such as
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
> 15KB
>
> Should it write each chunk of  bigger data size (such as 128 MB) with
> proper number of files ?
>
> Does anyone find out any performance changes when changing data size of
> each parquet file ?
>
> Thanks,
> Kevin.
>

Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee

There are a number of great resources to learn Apache Spark - a good
starting point is the Apache Spark Documentation at:
http://spark.apache.org/documentation.html


The two books that immediately come to mind are

- Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do
(there's also a Chinese language version of this book)

- Advanced Analytics with Apache Spark:
http://shop.oreilly.com/product/mobile/0636920035091.do

You can also find a pretty decent listing of Apache Spark resources at:
https://sparkhub.databricks.com/resources/

HTH!


On Sun, Nov 6, 2016 at 19:00 litg <1933443...@qq.com> wrote:

>I'm a postgraduate from  Shanghai Jiao Tong University,China.
> recently, I
> carry out a project about the  realization of artificial algorithms on
> spark
> in python. however, I am not familiar with this field.furthermore,there are
> few Chinese books about spark.
>  Actually,I strongly want to have a further study at this field.hope
> someone can  kindly recommend me some books about  the mechanism of spark,
> or just give me suggestions about how to  program with spark.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hope-someone-can-recommend-some-books-for-me-a-spark-beginner-tp28033.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee

The one you're looking for is the Data Sciences and Engineering with Apache
Spark at
https://www.edx.org/xseries/data-science-engineering-apacher-sparktm.

Note, a great quick start is the Getting Started with Apache Spark on
Databricks at https://databricks.com/product/getting-started-guide

HTH!

On Sun, Nov 6, 2016 at 22:20 Raghav  wrote:

> Can you please point out the right courses from EDX/Berkeley ?
>
> Many thanks.
>
> On Sun, Nov 6, 2016 at 6:08 PM, ayan guha  wrote:
>
> I would start with Spark documentation, really. Then you would probably
> start with some older videos from youtube, especially spark summit
> 2014,2015 and 2016 videos. Regading practice, I would strongly suggest
> Databricks cloud (or download prebuilt from spark site). You can also take
> courses from EDX/Berkley, which are very good starter courses.
>
> On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:
>
> I am newbie in the world of big data analytics, and I want to teach myself
> Apache Spark, and want to be able to write scripts to tinker with data.
>
> I have some understanding of Map Reduce but have not had a chance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee

If you're able to read the data in as a DataFrame, perhaps you can use a
BroadcastHashJoin so that way you can join to that table presuming its
small enough to distributed?  Here's a handy guide on a BroadcastHashJoin:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!

On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit  wrote:

> I have a lookup table in HANA database. I want to create a spark broadcast
> variable for it.
> What would be the suggested approach? Should I read it as an data frame
> and convert data frame into broadcast variable?
>
> Thanks,
> Nishit
>

Re: GraphFrame BFS

2016-11-01 Thread Denny Lee

You should be able to GraphX or GraphFrames subgraph to build up your
subgraph.  A good example for GraphFrames can be found at:
http://graphframes.github.io/user-guide.html#subgraphs.  HTH!

On Mon, Oct 10, 2016 at 9:32 PM cashinpj  wrote:

> Hello,
>
> I have a set of data representing various network connections.  Each vertex
> is represented by a single id, while the edges have  a source id,
> destination id, and a relationship (peer to peer, customer to provider, or
> provider to customer).  I am trying to create a sub graph build around a
> single source node following one type of edge as far as possible.
>
> For example:
> 1 2 p2p
> 2 3 p2p
> 2 3 c2p
>
> Following the p2p edges would give:
>
> 1 2 p2p
> 2 3 p2p
>
> I am pretty new to GraphX and GraphFrames, but was wondering if it is
> possible to get this behavior using the GraphFrames bfs() function or would
> it be better to modify the already existing Pregel implementation of bfs?
>
> Thank you for your time.
>
> Padraic
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphFrame-BFS-tp27876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark GraphFrames

2016-08-02 Thread Denny Lee

Hi Divya,

Here's a blog post concerning On-Time Flight Performance with GraphFrames:
https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html

It also includes a Databricks notebook that has the code in it.

HTH!
Denny


On Tue, Aug 2, 2016 at 1:16 AM Kazuaki Ishizaki  wrote:

> Sorry
> Please ignore this mail. Sorry for misinterpretation of GraphFrame in
> Spark. I thought that Frame Graph for profiling tool.
>
> Kazuaki Ishizaki,
>
>
>
> From:Kazuaki Ishizaki/Japan/IBM@IBMJP
> To:Divya Gehlot 
> Cc:"user @spark" 
> Date:2016/08/02 17:06
> Subject:Re: Spark GraphFrames
> --
>
>
>
> Hi,
> Kay wrote a procedure to use GraphFrames with Spark.
> *https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4*
> 
>
> Kazuaki Ishizaki
>
>
>
> From:Divya Gehlot 
> To:"user @spark" 
> Date:2016/08/02 14:52
> Subject:Spark GraphFrames
> --
>
>
>
> Hi,
>
> Has anybody has worked with GraphFrames.
> Pls let me know as I need to know the real case scenarios where It can
> used .
>
>
> Thanks,
> Divya
>
>
>

Re: Meetup in Rome

2016-02-19 Thread Denny Lee

Hey Domenico,

Glad to hear that you love Spark and would like to organize a meetup in
Rome. We created a Meetup-in-a-box to help with that - check out the post
https://databricks.com/blog/2015/11/19/meetup-in-a-box.html.

HTH!
Denny

On Fri, Feb 19, 2016 at 02:38 Domenico Pontari 
wrote:

>
> Hi guys,
> I spent till September 2015 in the bay area working with Spark and I love
> it. Now I'm back to Rome and I'd like to organize a meetup about it and Big
> Data in general. Any idea / suggestions? Can you eventually sponsor beers
> and pizza for it?
> Best,
> Domenico
>

Re: How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Denny Lee

Per http://spark.apache.org/docs/latest/submitting-applications.html:

For Python, you can use the --py-files argument of spark-submit to add .py,
.zip or .egg files to be distributed with your application. If you depend
on multiple Python files we recommend packaging them into a .zip or .egg.

On Fri, Jan 8, 2016 at 6:44 PM Ascot Moss  wrote:

> Hi,
>
> Instead of using Spark-shell, does anyone know how to build .zip (or .egg)
> for Python and use Spark-submit to run?
>
> Regards
>

Re: subscribe

2016-01-08 Thread Denny Lee

To subscribe, please go to http://spark.apache.org/community.html to join
the mailing list.


On Fri, Jan 8, 2016 at 3:58 AM Jeetendra Gangele 
wrote:

>
>

Re: Intercept in Linear Regression

2015-12-15 Thread Denny Lee

If you're using
model = LinearRegressionWithSGD.train(parseddata, iterations=100,
step=0.01, intercept=True)

then to get the intercept, you would use
model.intercept

More information can be found at:
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression

HTH!


On Tue, Dec 15, 2015 at 11:06 PM Arunkumar Pillai 
wrote:

>
> How to get intercept in  Linear Regression Model?
>
> LinearRegressionWithSGD.train(parsedData, numIterations)
>
> --
> Thanks and Regards
> Arun
>

Re: Best practises

2015-11-02 Thread Denny Lee

In addition, you may want to check out Tuning and Debugging in Apache Spark
(https://sparkhub.databricks.com/video/tuning-and-debugging-apache-spark/)

On Mon, Nov 2, 2015 at 05:27 Stefano Baghino 
wrote:

> There is this interesting book from Databricks:
> https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
>
> What do you think? Does it contain the info you're looking for? :)
>
> On Mon, Nov 2, 2015 at 2:18 PM, satish chandra j  > wrote:
>
>> HI All,
>> Yes, any such doc will be a great help!!!
>>
>>
>>
>> On Fri, Oct 30, 2015 at 4:35 PM, huangzheng <1106944...@qq.com> wrote:
>>
>>> I have the same question.anyone help us.
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Deepak Sharma";
>>> *发送时间:* 2015年10月30日(星期五) 晚上7:23
>>> *收件人:* "user";
>>> *主题:* Best practises
>>>
>>> Hi
>>> I am looking for any blog / doc on the developer's best practices if
>>> using Spark .I have already looked at the tuning guide on
>>> spark.apache.org.
>>> Please do let me know if any one is aware of any such resource.
>>>
>>> Thanks
>>> Deepak
>>>
>>
>>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>

Spark Survey Results 2015 are now available

2015-10-05 Thread Denny Lee

Thanks to all of you who provided valuable feedback in our Spark Survey
2015.  Because of the survey, we have a better picture of who’s using
Spark, how they’re using it, and what they’re using it to build–insights
that will guide major updates to the Spark platform as we move into Spark’s
next phase of growth. The results are summarized in an info graphic
available here: Spark Survey Results 2015 are now available
.
Thank you to everyone who participated in Spark Survey 2015 and for your
help in shaping Spark’s future!

Re: SQL Server to Spark

2015-07-23 Thread Denny Lee

It sort of depends on optimized. There is a good thread on the topic at
http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded

If you have an archival type strategy, you could do daily BCP extracts out
to load the data into HDFS / S3 / etc. This would result in minimal impact
to SQL Server for the extracts (for that scenario, that was of primary
importance).

On Thu, Jul 23, 2015 at 16:42 vinod kumar vinodsachin...@gmail.com wrote:

 Hi Everyone,

 I am in need to use the table from MsSQLSERVER in SPARK.Any one please
 share me the optimized way for that?

 Thanks in advance,
 Vinod

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee

I went ahead and tested your file and the results from the tests can be
seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.

Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default}
the query ran without any issues.  I was able to recreate the issue with
{Java 7, default}.  I included the commands I used to start the spark-shell
but basically I just used all defaults (no alteration to driver or executor
memory) with the only additional call was with driver-class-path to connect
to MySQL Hive metastore.  This is on OSX Macbook Pro.

One thing I did notice is that your version of Java 7 is version 51 while
my version of Java 7 version 79.  Could you see if updating to Java 7
version 79 perhaps allows you to use the MaxPermSize call?




On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov s...@swoop.com wrote:

  The file is at
 https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1

  The command was included in the gist

  SPARK_REPL_OPTS=-XX:MaxPermSize=256m
 spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
 com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Monday, July 6, 2015 at 12:59 AM
 To: Simeon Simeonov s...@swoop.com
 Cc: Denny Lee denny.g@gmail.com, Andy Huang 
 andy.hu...@servian.com.au, user user@spark.apache.org

 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   I have never seen issue like this. Setting PermGen size to 256m should
 solve the problem. Can you send me your test file and the command used to
 launch the spark shell or your application?

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote:

   Yin,

  With 512Mb PermGen, the process still hung and had to be kill -9ed.

  At 1Gb the spark shell  associated processes stopped hanging and
 started exiting with

  scala println(dfCount.first.getLong(0))
 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
 called with curMem=0, maxMem=2223023063
 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
 values in memory (estimated size 229.5 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184) called
 with curMem=235040, maxMem=2223023063
 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
 in memory on localhost:65464 (size: 19.7 KB, free: 2.1 GB)
 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from first
 at console:30
 java.lang.OutOfMemoryError: PermGen space
 Stopping spark context.
 Exception in thread main
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread main
 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
 broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
 GB)

  That did not change up until 4Gb of PermGen space and 8Gb for driver 
 executor each.

  I stopped at this point because the exercise started looking silly. It
 is clear that 1.4.0 is using memory in a substantially different manner.

  I'd be happy to share the test file so you can reproduce this in your
 own environment.

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Sunday, July 5, 2015 at 11:04 PM
 To: Denny Lee denny.g@gmail.com
 Cc: Andy Huang andy.hu...@servian.com.au, Simeon Simeonov 
 s...@swoop.com, user user@spark.apache.org
 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   Sim,

  Can you increase the PermGen size? Please let me know what is your
 setting when the problem disappears.

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote:

  I had run into the same problem where everything was working
 swimmingly with Spark 1.3.1.  When I switched to Spark 1.4, either by
 upgrading to Java8 (from Java7) or by knocking up the PermGenSize had
 solved my issue.  HTH!



  On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au
 wrote:

 We have hit the same issue in spark shell when registering a temp
 table. We observed it happening with those who had JDK 6. The problem went
 away after installing jdk 8. This was only for the tutorial materials which
 was about loading a parquet file.

  Regards
 Andy

 On Sat, Jul 4, 2015 at 2:54 AM, sim s...@swoop.com wrote:

 @bipin, in my case the error happens immediately in a fresh shell in
 1.4.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
  Sent from

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee

Within the context of your question, Spark SQL utilizing the Hive context
is primarily about very fast queries.  If you want to use real-time
queries, I would utilize Spark Streaming.  A couple of great resources on
this topic include Guest Lecture on Spark Streaming in Stanford CME 323:
Distributed Algorithms and Optimization
http://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford
and Recipes for Running Spark Streaming Applications in Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/
(from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki florian.spierc...@crisalid.com
wrote:

 Hello,

 I'm actually asking my self about performance of using Spark SQL with Hive
 to do real time analytics.
 I know that Hive has been created for batch processing, and Spark is use to
 do fast queries.

 But, use Spark SQL with Hive will allow me to do real time queries ? Or it
 just will make fastest queries but not real time.
 Should I use an other datawarehouse, like Hbase ?

 Thanks in advance for your time and consideration,
 Florian



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee

Hey Dean,
Sure, will take care of this.
HTH,
Denny

On Tue, Jul 7, 2015 at 10:07 Dean Wampler deanwamp...@gmail.com wrote:

 Here's our home page: http://www.meetup.com/Chicago-Spark-Users/

 Thanks,
 Dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Denny Lee

I had run into the same problem where everything was working swimmingly
with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
HTH!



On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au wrote:

 We have hit the same issue in spark shell when registering a temp table.
 We observed it happening with those who had JDK 6. The problem went away
 after installing jdk 8. This was only for the tutorial materials which was
 about loading a parquet file.

 Regards
 Andy

 On Sat, Jul 4, 2015 at 2:54 AM, sim s...@swoop.com wrote:

 @bipin, in my case the error happens immediately in a fresh shell in
 1.4.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
 f: 02 9376 0730| m: 0433221979

Hive Skew flag?

2015-05-15 Thread Denny Lee

Just wondering if we have any timeline on when the hive skew flag will be
included within SparkSQL?

Thanks!
Denny

Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee

Delete from table is available as part of Hive 0.14 (reference: Apache Hive
 Language Manual DML - Delete
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete)
while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive
0.14 or generate a new table filtering out the values you do not want.

On Thu, May 14, 2015 at 3:26 AM luohui20...@sina.com wrote:

 Hi guys

i got to delete some data from a table by delete from table where
 name = xxx, however delete is not functioning like the DML operation in
 hive.  I got a info like below:

 Usage: delete [FILE|JAR|ARCHIVE] value [value]*

 15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor: Usage: delete
 [FILE|JAR|ARCHIVE] value [value]*



I checked the list of Supported Hive Features , but not found if
 this dml is supported.

So any comments will be appreciated.

 

 Thanksamp;Best regards!
 San.Luo

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee

Similar to what Dean called out, we build Puppet manifests so we could do
the automation - its a bit of work to setup, but well worth the effort.

On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote:

 It's mostly manual. You could try automating with something like Chef, of
 course, but there's nothing already available in terms of automation.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Fri, Apr 24, 2015 at 10:33 AM, James King jakwebin...@gmail.com
 wrote:

 Thanks Dean,

 Sure I have that setup locally and testing it with ZK.

 But to start my multiple Masters do I need to go to each host and start
 there or is there a better way to do this.

 Regards
 jk

 On Fri, Apr 24, 2015 at 5:23 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 The convention for standalone cluster is to use Zookeeper to manage
 master failover.

 http://spark.apache.org/docs/latest/spark-standalone.html

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Fri, Apr 24, 2015 at 5:01 AM, James King jakwebin...@gmail.com
 wrote:

 I'm trying to find out how to setup a resilient Spark cluster.

 Things I'm thinking about include:

 - How to start multiple masters on different hosts?
 - there isn't a conf/masters file from what I can see


 Thank you.

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee

You may need to specify the hive port itself.  For example, my own Thrift
start command is in the form:

./sbin/start-thriftserver.sh --master spark://$myserver:7077
--driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host
$myserver --hiveconf hive.server2.thrift.port 1

HTH!


On Wed, Apr 22, 2015 at 5:27 AM Yiannis Gkoufas johngou...@gmail.com
wrote:

 Hi Himanshu,

 I am using:

 ./start-thriftserver.sh --master spark://localhost:7077

 Do I need to specify something additional to the command?

 Thanks!

 On 22 April 2015 at 13:14, Himanshu Parashar himanshu.paras...@gmail.com
 wrote:

 what command are you using to start the Thrift server?

 On Wed, Apr 22, 2015 at 3:52 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi all,

 I am trying to start the thriftserver and I get some errors.
 I have hive running and placed hive-site.xml under the conf directory.
 From the logs I can see that the error is:

 Call From localhost to localhost:54310 failed

 I am assuming that it tries to connect to the wrong port for the
 namenode, which in my case its running on 9000 instead of 54310

 Any help would be really appreciated.

 Thanks a lot!




 --
 [HiM]

Re: Skipped Jobs

2015-04-19 Thread Denny Lee

Thanks for the correction Mark :)

On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com
wrote:

 Almost.  Jobs don't get skipped.  Stages and Tasks do if the needed
 results are already available.

 On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote:

 The job is skipped because the results are available in memory from a
 prior run.  More info at:
 http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E.
 HTH!

 On Sun, Apr 19, 2015 at 1:43 PM James King jakwebin...@gmail.com wrote:

 In the web ui i can see some jobs as 'skipped' what does that mean? why
 are these jobs skipped? do they ever get executed?

 Regards
 jk

Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee

Support for sub queries in predicates hasn't been resolved yet - please
refer to SPARK-4226

BTW, Spark 1.3 default bindings to Hive 0.13.1




On Fri, Apr 17, 2015 at 09:18 ARose ashley.r...@telarix.com wrote:

 So I'm trying to store the results of a query into a DataFrame, but I get
 the
 following exception thrown:

 Exception in thread main java.lang.RuntimeException: [1.71] failure:
 ``*''
 expected but `select' found

 SELECT DISTINCT OutSwitchID FROM wtbECRTemp WHERE OutSwtichID NOT IN
 (SELECT
 SwitchID FROM tmpCDRSwitchIDs)

 And it has a ^ pointing to the second SELECT. But according to this
 (
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
 ),
 subqueries should be supported with Hive 0.13.0.

 So which version is Spark using? And if subqueries are not currently
 supported, what would be a suitable alternative to this?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Which-version-of-Hive-QL-is-Spark-1-3-0-using-tp22542.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee

Bummer - out of curiosity, if you were to use the classpath.first or
perhaps copy the jar to the slaves could that actually do the trick?  The
latter isn't really all that efficient but just curious if that could do
the trick.


On Thu, Apr 16, 2015 at 7:14 AM ARose ashley.r...@telarix.com wrote:

 I take it back. My solution only works when you set the master to local.
 I
 get the same error when I try to run it on the cluster.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22525.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Converting Date pattern in scala code

2015-04-14 Thread Denny Lee

If you're doing in Scala per se - then you can probably just reference
JodaTime or Java Date / Time classes.  If are using SparkSQL, then you can
use the various Hive date functions for conversion.

On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA ab9...@att.com wrote:

 I need some help to convert the date pattern in my Scala code for Spark
 1.3. I am reading the dates from two flat files having two different date
 formats.

 File 1:
 2015-03-27

 File 2:
 02-OCT-12
 09-MAR-13

 This format of file 2 is not being recognized by my Spark SQL when I am
 comparing it in a WHERE clause on the date fields. Format of file 1 is
 being recognized better. How to convert the format in file 2 to match with
 the format in file 1?

 Regards
 Ananda

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee

By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to
Hive 0.12 if you specify it in the profile when building Spark as per
https://spark.apache.org/docs/1.3.0/building-spark.html.

If you are downloading a pre built version of Spark 1.3 - then by default,
it is set to Hive 0.13.1.

HTH!

On Thu, Apr 9, 2015 at 10:03 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Most likely you have an existing Hive installation with data in it. In
 this case i was not able to get Spark 1.3 communicate with existing Hive
 meta store. Hence when i read any table created in hive, Spark SQL used to
 complain Data table not found

 If you get it working, please share the steps.

 On Thu, Apr 9, 2015 at 9:25 PM, Arthur Chan arthur.hk.c...@gmail.com
 wrote:

 Hi,

 I use Hive 0.12 for Spark 1.2 at the moment and plan to upgrade to Spark
 1.3.x

 Could anyone advise which Hive version should be used to match Spark
 1.3.x?
 Can I use Hive 1.1.0 for Spark 1.3? or can I use Hive 0.14 for Spark 1.3?

 Regards
 Arthur




 --
 Deepak

Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee

Can you create the database directly within Hive?  If you're getting the
same error within Hive, it sounds like a permissions issue as per Bojan.
More info can be found at:
http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error


On Thu, Apr 9, 2015 at 7:31 AM Bojan Kostic blood9ra...@gmail.com wrote:

 I think it uses local dir, hdfs dir path starts with hdfs://

 Check permissions on folders, and also check logs. There should be more
 info
 about exception.

 Best
 Bojan



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-
 tp22435p22439.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee

At this time, the JDBC Data source is not extensible so it cannot support
SQL Server.   There was some thoughts - credit to Cheng Lian for this -
 about making the JDBC data source extensible for third party support
possibly via slick.


On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com wrote:

 Hi, I am trying to pull data from ms-sql server. I have tried using the
 spark.sql.jdbc

 CREATE TEMPORARY TABLE c
 USING org.apache.spark.sql.jdbc
 OPTIONS (
 url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
 dbtable Customer
 );

 But it shows java.sql.SQLException: No suitable driver found for
 jdbc:sqlserver

 I have jdbc drivers for mssql but i am not sure how to use them I provide
 the jars to the sql shell and then tried the following:

 CREATE TEMPORARY TABLE c
 USING com.microsoft.sqlserver.jdbc.SQLServerDriver
 OPTIONS (
 url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
 dbtable Customer
 );

 But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
 class com.microsoft.sqlserver.jdbc.SQLServerDriver)

 Can anyone tell what is the proper way to connect to ms-sql server.
 Thanks






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-
 sql-tp22399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee

That's correct, at this time MS SQL Server is not supported through the
JDBC data source at this time.  In my environment, we've been using Hadoop
streaming to extract out data from multiple SQL Servers, pushing the data
into HDFS, creating the Hive tables and/or converting them into Parquet,
and then Spark can access them directly.   Due to my heavy use of SQL
Server, I've been thinking about seeing if i can help with the extension of
the JDBC data source so it can be supported - but alas, I haven't found the
time yet ;)

On Tue, Apr 7, 2015 at 6:52 AM ARose ashley.r...@telarix.com wrote:

 I am having the same issue with my java application.

 String url = jdbc:sqlserver:// + host + :1433;DatabaseName= +
 database + ;integratedSecurity=true;
 String driver = com.microsoft.sqlserver.jdbc.SQLServerDriver;

 SparkConf conf = new
 SparkConf().setAppName(appName).setMaster(master);
 JavaSparkContext sc = new JavaSparkContext(conf);
 SQLContext sqlContext = new SQLContext(sc);

 MapString, String options = new HashMap();
 options.put(driver, driver);
 options.put(url, url);
 options.put(dbtable, tbTableName);

 DataFrame jdbcDF = sqlContext.load(jdbc, options);
 jdbcDF.printSchema();
 jdbcDF.show();

 It prints the schema of the DataFrame just fine, but as soon as it tries to
 evaluate it for the show() call, I get a ClassNotFoundException for the
 driver. But the driver is definitely included as a dependency, so is  MS
 SQL
 Server just not supported?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-
 from-spark-sql-tp22399p22404.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee

 Sweet - I'll have to play with this then! :)
On Fri, Apr 3, 2015 at 19:43 Reynold Xin r...@databricks.com wrote:

 There is already an explode function on DataFrame btw


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712

 I think something like this would work. You might need to play with the
 type.

 df.explode(arrayBufferColumn) { x = x }



 On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee denny.g@gmail.com wrote:

 Thanks Dean - fun hack :)

 On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com
 wrote:

 A hack workaround is to use flatMap:

 rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

 For those of you who don't know Scala, the for comprehension iterates
 through the ArrayBuffer, named array and yields new tuples with the date
 and each element. The case expression to the left of the = pattern matches
 on the input tuples.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com
 wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for
 some reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com
 wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee

Thanks Dean - fun hack :)

On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote:

 A hack workaround is to use flatMap:

 rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

 For those of you who don't know Scala, the for comprehension iterates
 through the ArrayBuffer, named array and yields new tuples with the date
 and each element. The case expression to the left of the = pattern matches
 on the input tuples.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for some
 reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee

Quick question - the output of a dataframe is in the format of:

[2015-04, ArrayBuffer(A, B, C, D)]

and I'd like to return it as:

2015-04, A
2015-04, B
2015-04, C
2015-04, D

What's the best way to do this?

Thanks in advance!

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee

Thanks Michael - that was it!  I was drawing a blank on this one for some
reason - much appreciated!


On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode shorthand
 directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee

Thanks Felix :)

On Wed, Apr 1, 2015 at 00:08 Felix Cheung felixcheun...@hotmail.com wrote:

 This is tracked by these JIRAs..

 https://issues.apache.org/jira/browse/SPARK-5947
 https://issues.apache.org/jira/browse/SPARK-5948

 --
 From: denny.g@gmail.com
 Date: Wed, 1 Apr 2015 04:35:08 +
 Subject: Creating Partitioned Parquet Tables via SparkSQL
 To: user@spark.apache.org


 Creating Parquet tables via .saveAsTable is great but was wondering if
 there was an equivalent way to create partitioned parquet tables.

 Thanks!

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee

Hi Vincent,

This may be a case that you're missing a semi-colon after your CREATE
TEMPORARY TABLE statement.  I ran your original statement (missing the
semi-colon) and got the same error as you did.  As soon as I added it in, I
was good to go again:

CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path /samples/people.json
);
-- above needed a semi-colon so the temporary table could be created first
SELECT * FROM jsonTable;

HTH!
Denny


On Sun, Mar 29, 2015 at 6:59 AM Vincent He vincent.he.andr...@gmail.com
wrote:

 No luck, it does not work, anyone know whether there some special setting
 for spark-sql cli so we do not need to write code to use spark sql? Anyone
 have some simple example on this? appreciate any help. thanks in advance.

 On Sat, Mar 28, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 See
 https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

 I haven't tried the SQL statements in above blog myself.

 Cheers

 On Sat, Mar 28, 2015 at 5:39 AM, Vincent He vincent.he.andr...@gmail.com
  wrote:

 thanks for your information . I have read it, I can run sample with
 scala or python, but for spark-sql shell, I can not get an exmaple running
 successfully, can you give me an example I can run with ./bin/spark-sql
 without writing any code? thanks

 On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at
 https://spark.apache.org/docs/latest/sql-programming-guide.html

 Cheers



  On Mar 28, 2015, at 5:08 AM, Vincent He vincent.he.andr...@gmail.com
 wrote:
 
 
  I am learning spark sql and try spark-sql example,  I running
 following code, but I got exception ERROR CliDriver:
 org.apache.spark.sql.AnalysisException: cannot recognize input near
 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17, I have two
 questions,
  1. Do we have a list of the statement supported in spark-sql ?
  2. Does spark-sql shell support hiveql ? If yes, how to set?
 
  The example I tried:
  CREATE TEMPORARY TABLE jsonTable
  USING org.apache.spark.sql.json
  OPTIONS (
path examples/src/main/resources/people.json
  )
  SELECT * FROM jsonTable
  The exception I got,
   CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path examples/src/main/resources/people.json
)
SELECT * FROM jsonTable
;
  15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY
 TABLE jsonTable
  USING org.apache.spark.sql.json
  OPTIONS (
path examples/src/main/resources/people.json
  )
  SELECT * FROM jsonTable
  NoViableAltException(241@[654:1: ddlStatement : (
 createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement |
 createTableStatement | dropTableStatement | truncateTableStatement |
 alterStatement | descStatement | showStatement | metastoreCheck |
 createViewStatement | dropViewStatement | createFunctionStatement |
 createMacroStatement | createIndexStatement | dropIndexStatement |
 dropFunctionStatement | dropMacroStatement | analyzeStatement |
 lockStatement | unlockStatement | lockDatabase | unlockDatabase |
 createRoleStatement | dropRoleStatement | grantPrivileges |
 revokePrivileges | showGrants | showRoleGrants | showRolePrincipals |
 showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
  at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
  at org.antlr.runtime.DFA.predict(DFA.java:144)
  at
 org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
  at
 org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
  at
 org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
  at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
  at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
  at
 org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
  at
 org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
  at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
  at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
  at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
  at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
  at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
  at

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee

Upon reviewing your other thread, could you confirm that your Hive
metastore that you can connect to via Hive is a MySQL database?  And to
also confirm, when you're running spark-shell and doing a show tables
statement, you're getting the same error?


On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I tried the following

 1)

 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:
 *$SPARK_HOME/conf/hive-site.xml*  --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
 --num-executors 1 --driver-memory 4g --driver-java-options
 -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue
 hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
 spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 This throws dw_bid not found. Looks like Spark SQL is unable to read my
 existing Hive metastore and creates its own and hence complains that table
 is not found.


 2)

 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
   --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:
 *$SPARK_HOME/conf/hive-site.xml* --num-executors 1 --driver-memory 4g
 --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

 This time i do not get above error, however i get MySQL driver not found
 exception. Looks like this is even before its able to communicate to Hive.

 Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke
 the BONECP plugin to create a ConnectionPool gave an error : The
 specified datastore driver (com.mysql.jdbc.Driver) was not found in the
 CLASSPATH. Please check your CLASSPATH specification, and the name of the
 driver.

 In both above cases, i do have hive-site.xml in Spark/conf folder.

 3)
 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
   --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar--num-executors
 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G
 --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

 I do not specify hive-site.xml in --jars or --driver-class-path. Its
 present in spark/conf folder as per
 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
 .

 In this case i get same error as #1. dw_bid table not found.

 I want Spark SQL to know that there are tables in Hive and read that data.
 As per guide it looks like Spark SQL has that support.

 Please suggest.

 Regards,
 Deepak


 On Thu, Mar 26, 2015 at 9:01 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Stack Trace:

 15/03/26 08:25:42 INFO ql.Driver: OK
 15/03/26 08:25:42 INFO log.PerfLogger: PERFLOG method=releaseLocks
 from=org.apache.hadoop.hive.ql.Driver
 15/03/26 08:25:42 INFO log.PerfLogger: /PERFLOG method=releaseLocks
 start=1427383542966 end=1427383542966 duration=0

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee

If you're not using MySQL as your metastore for Hive, out of curiosity what
are you using?

The error you are seeing is common when there isn't the correct driver to
allow Spark to connect to the Hive metastore because the correct driver
isn't there.

As well, I noticed that you're using SPARK_CLASSPATH which has been
deprecated.  Depending on your scenario, you may want to use --jars,
--driver-class-path, or extraClassPath.  A good thread on this topic can be
found at
http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3C01a901d0547c$a23ba480$e6b2ed80$@innowireless.com%3E
.

For example, when I connect to my own Hive metastore via Spark 1.3, I
reference the --driver-class-path where in my case I am using MySQL as my
Hive metastore:

./bin/spark-sql --master spark://$standalone$:7077 --driver-class-path
mysql-connector-$version$.jar

HTH!


On Thu, Mar 26, 2015 at 8:09 PM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I do not use MySQL, i want to read Hive tables from Spark SQL and
 transform them in Spark SQL. Why do i need a MySQL driver ? If i still need
 it which version should i use.

 Assuming i need it, i downloaded the latest version of it from
 http://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.34 and
 ran the following commands, i do not see above exception , however i see a
 new one.





 export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
 export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
 export
 SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:
 */home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*
 export HADOOP_CONF_DIR=/apache/hadoop/conf
 cd $SPARK_HOME
 ./bin/spark-sql
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 ...
 ...

 spark-sql

 spark-sql

 spark-sql


 show tables;

 15/03/26 20:03:57 INFO metastore.HiveMetaStore: 0: get_tables: db=default
 pat=.*

 15/03/26 20:03:57 INFO HiveMetaStore.audit: ugi=dvasthi...@corp.ebay.com
 ip=unknown-ip-addr cmd=get_tables: db=default pat=.*

 15/03/26 20:03:58 INFO spark.SparkContext: Starting job: collect at
 SparkPlan.scala:83

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Got job 1 (collect at
 SparkPlan.scala:83) with 1 output partitions (allowLocal=false)

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Final stage: Stage
 1(collect at SparkPlan.scala:83)

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Missing parents: List()

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Submitting Stage 1
 (MapPartitionsRDD[3] at map at SparkPlan.scala:83), which has no missing
 parents

 15/03/26 20:03:58 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1

 15/03/26 20:03:58 INFO scheduler.StatsReportListener: Finished stage:
 org.apache.spark.scheduler.StageInfo@2bfd9c4d

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Job 1 failed: collect at
 SparkPlan.scala:83, took 0.005163 s

 15/03/26 20:03:58 ERROR thriftserver.SparkSQLDriver: Failed in [show
 tables]

 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 serialization failed: java.lang.reflect.InvocationTargetException

 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)


 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)


 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

 java.lang.reflect.Constructor.newInstance(Constructor.java:526)


 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)


 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)

 org.apache.spark.broadcast.TorrentBroadcast.org
 $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)


 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:79)


 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)


 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)


 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)

 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)

 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)

 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee

BTW, a tool that I have been using to help do the preaggregation of data
using hyperloglog in combination with Spark is atscale (http://atscale.com/).
It builds the aggregations and makes use of the speed of SparkSQL - all
within the context of a model that is accessible by Tableau or Qlik.

On Thu, Mar 26, 2015 at 8:55 AM Jörn Franke jornfra...@gmail.com wrote:

 As I wrote previously - indexing is not your only choice, you can
 preaggregate data during load or depending on your needs you  need to think
 about other data structures, such as graphs, hyperloglog, bloom filters
 etc. (challenge to integrate in standard bi tools)
 Le 26 mars 2015 13:34, kundan kumar iitr.kun...@gmail.com a écrit :

 I was looking for some options and came across JethroData.

 http://www.jethrodata.com/

 This stores the data maintaining indexes over all the columns seems good
 and claims to have better performance than Impala.

 Earlier I had tried Apache Phoenix because of its secondary indexing
 feature. But the major challenge I faced there was, secondary indexing was
 not supported for bulk loading process.
 Only the sequential loading process supported the secondary indexes,
 which took longer time.


 Any comments on this ?




 On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar iitr.kun...@gmail.com
 wrote:

 I looking for some options and came across

 http://www.jethrodata.com/

 On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com
 wrote:

 You can also preaggregate results for the queries by the user -
 depending on what queries they use this might be necessary for any
 underlying technology
 Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to
 redshift ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee

As you noted, you can change the spark.driver.maxResultSize value in your
Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html).
Please reference the Spark Properties section noting that you can modify
these properties via the spark-defaults.conf or via SparkConf().

HTH!



On Wed, Mar 25, 2015 at 8:01 AM Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  Hi



 I ran a spark job and got the following error. Can anybody tell me how to
 work around this problem? For example how can I increase
 spark.driver.maxResultSize? Thanks.

  org.apache.spark.SparkException: Job aborted due to stage failure: Total
 size of serialized results

 of 128 tasks (1029.1 MB) is bigger than spark.driver.maxResultSize (1024.0
 MB)

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobA

 ndIndependentStages(DAGScheduler.scala:1214)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12

 03)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12

 02)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler

 .scala:696)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler

 .scala:696)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(D

 AGScheduler.scala:1420)

 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala

 :1375)

 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

at akka.actor.ActorCell.invoke(ActorCell.scala:487)

 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

 at akka.dispatch.Mailbox.run(Mailbox.scala:220)

 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala

 :393)

 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 15/03/25 10:48:38 WARN TaskSetManager: Lost task 128.0 in stage 199.0 (TID
 6324, INT1-CAS01.pcc.lexi

 snexis.com): TaskKilled (killed intentionally)



 Ningjun

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee

Perhaps this email reference may be able to help from a DataFrame
perspective:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E


On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang hw...@qilinsoft.com wrote:

  Hi,



 I have a DataFrame object and I want to do types of aggregations like
 count, sum, variance, stddev, etc.



 DataFrame has DSL to do simple aggregations like count and sum.



 How about variance and stddev?



 Thank you for any suggestions!

Re: Errors in SPARK

2015-03-24 Thread Denny Lee

The error you're seeing typically means that you cannot connect to the Hive
metastore itself.  Some quick thoughts:
- If you were to run show tables (instead of the CREATE TABLE statement),
are you still getting the same error?

- To confirm, the Hive metastore (MySQL database) is up and running

- Did you download or build your version of Spark?




On Tue, Mar 24, 2015 at 10:48 PM sandeep vura sandeepv...@gmail.com wrote:

 Hi Denny,

 Still facing the same issue.Please find the following errors.

 *scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)*
 *sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.HiveContext@4e4f880c*

 *scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))*
 *java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient*

 Cheers,
 Sandeep.v

 On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com
 wrote:

 No I am just running ./spark-shell command in terminal I will try with
 above command

 On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com
 wrote:

 Did you include the connection to a MySQL connector jar so that way
 spark-shell / hive can connect to the metastore?

 For example, when I run my spark-shell instance in standalone mode, I
 use:
 ./spark-shell --master spark://servername:7077 --driver-class-path /lib/
 mysql-connector-java-5.1.27.jar



 On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com
 wrote:

 Hi Sparkers,

 Can anyone please check the below error and give solution for this.I am
 using hive version 0.13 and spark 1.2.1 .

 Step 1 : I have installed hive 0.13 with local metastore (mySQL
 database)
 Step 2:  Hive is running without any errors and able to create tables
 and loading data in hive table
 Step 3: copied hive-site.xml in spark/conf directory
 Step 4: copied core-site.xml in spakr/conf directory
 Step 5: started spark shell

 Please check the below error for clarifications.

 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.Hi
  veContext@2821ec0c

 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate or
g.apache.hadoop.hive.
 metastore.HiveMetaStoreClient
 at 
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

a:346)
 at 
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

ala:235)
 at 
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

ala:231)
 at scala.Option.orElse(Option.scala:257)
 at 
 org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal

a:231)
 at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.
 scala:229)
 at 
 org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext

.scala:229)
 at org.apache.spark.sql.hive.HiveContext.hiveconf(
 HiveContext.scala:229)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa

talog.scala:55)

 Regards,
 Sandeep.v

Re: Errors in SPARK

2015-03-24 Thread Denny Lee

Did you include the connection to a MySQL connector jar so that way
spark-shell / hive can connect to the metastore?

For example, when I run my spark-shell instance in standalone mode, I use:
./spark-shell --master spark://servername:7077 --driver-class-path
/lib/mysql-connector-java-5.1.27.jar



On Fri, Mar 13, 2015 at 8:31 AM sandeep vura sandeepv...@gmail.com wrote:

 Hi Sparkers,

 Can anyone please check the below error and give solution for this.I am
 using hive version 0.13 and spark 1.2.1 .

 Step 1 : I have installed hive 0.13 with local metastore (mySQL database)
 Step 2:  Hive is running without any errors and able to create tables and
 loading data in hive table
 Step 3: copied hive-site.xml in spark/conf directory
 Step 4: copied core-site.xml in spakr/conf directory
 Step 5: started spark shell

 Please check the below error for clarifications.

 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.Hi
  veContext@2821ec0c

 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate or

  g.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

  a:346)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:235)
 at
 org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.sc

  ala:231)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scal

  a:231)
 at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext

  .scala:229)
 at
 org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
 at
 org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCa

  talog.scala:55)

 Regards,
 Sandeep.v

Re: Standalone Scheduler VS YARN Performance

2015-03-24 Thread Denny Lee

By any chance does this thread address look similar:
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
?



On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 What is performance overhead caused by YARN, or what configurations are
 being changed when the app is ran through YARN?

 The following example:

 sqlContext.sql(SELECT dayStamp(date),
 count(distinct deviceId) AS c
 FROM full
 GROUP BY dayStamp(date)
 ORDER BY c
 DESC LIMIT 10)
 .collect()

 runs on shell when we use standalone scheduler:
 ./spark-shell --master sparkmaster:7077 --executor-memory 20g
 --executor-cores 10  --driver-memory 10g --num-executors 8

 and fails due to losing an executor, when we run it through YARN.
 ./spark-shell --master yarn-client --executor-memory 20g --executor-cores
 10  --driver-memory 10g --num-executors 8

 There are no evident logs, just messages that executors are being lost,
 and connection refused errors, (apparently due to executor failures)
 The cluster is the same, 8 nodes, 64Gb RAM each.
 Format is parquet.

 --
 RGRDZ Harut

Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Denny Lee

Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile
-Phadoop-2.4

Please note earlier in the link the section:

# Apache Hadoop 2.4.X or 2.5.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

Versions of Hadoop after 2.5.X may or may not work with the
-Phadoop-2.4 profile (they were
released after this version of Spark).


HTH!

On Tue, Mar 24, 2015 at 10:28 AM Manoj Samel manojsamelt...@gmail.com
wrote:


 http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn
 does not list hadoop 2.5 in Hadoop version table table etc.

 I assume it is still OK to compile with  -Pyarn -Phadoop-2.5 for use with
 Hadoop 2.5 (cdh 5.3.2)

 Thanks,

Re: Use pig load function in spark

2015-03-23 Thread Denny Lee

You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do
this: https://github.com/sigmoidanalytics/spork


On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin yun...@ebay.com wrote:

  Hi, all



 Can spark use pig’s load function to load data?



 Best Regards,

 Kevin.

Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee

+1  - I currently am doing what Marcelo is suggesting as I have a CDH 5.2
cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in
my cluster.

On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin van...@cloudera.com wrote:

 Since you're using YARN, you should be able to download a Spark 1.3.0
 tarball from Spark's website and use spark-submit from that
 installation to launch your app against the YARN cluster.

 So effectively you would have 1.2.0 and 1.3.0 side-by-side in your cluster.

 On Wed, Mar 18, 2015 at 11:09 AM, jaykatukuri jkatuk...@apple.com wrote:
  Hi all,
  I am trying to run my job which needs spark-sql_2.11-1.3.0.jar.
  The cluster that I am running on is still on spark-1.2.0.
 
  I tried the following :
 
  spark-submit --class class-name --num-executors 100 --master yarn
  application_jar--jars hdfs:///path/spark-sql_2.11-1.3.0.jar
  hdfs:///input_data
 
  But, this did not work, I get an error that it is not able to find a
  class/method that is in spark-sql_2.11-1.3.0.jar .
 
  org.apache.spark.sql.SQLContext.implicits()Lorg/
 apache/spark/sql/SQLContext$implicits$
 
  The question in general is how do we use a different version of spark
 jars
  (spark-core, spark-sql, spark-ml etc) than the one's running on a
 cluster ?
 
  Thanks,
  Jay
 
 
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Using-a-different-spark-jars-than-the-
 one-on-the-cluster-tp22125.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Should I do spark-sql query on HDFS or hive?

2015-03-23 Thread Denny Lee

From the standpoint of Spark SQL accessing the files - when it is hitting
Hive, it is in effect hitting HDFS as well.  Hive provides a great
framework where the table structure is already well defined.But
underneath it, Hive is just accessing files from HDFS so you are hitting
HDFS either way.  HTH!

On Tue, Mar 17, 2015 at 3:41 AM 李铖 lidali...@gmail.com wrote:

 Hi,everybody.

 I am new in spark. Now I want to do interactive sql query using spark sql.
 spark sql can run under hive or loading files from hdfs.

 Which is better or faster?

 Thanks.

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee

How are you running your spark instance out of curiosity?  Via YARN or
standalone mode?  When connecting Spark thriftserver to the Spark service,
have you allocated enough memory and CPU when executing with spark?

On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote:

 We have cloudera CDH 5.3 installed on one machine.

 We are trying to use spark sql thrift server to execute some analysis
 queries against hive table.

 Without any changes in the configurations, we run the following query on
 both hive and spark sql thrift server

 *select * from tableName;*

 The time taken by spark is larger than the time taken by hive which is not
 supposed to be the like that.

 The hive table is mapped to json files stored on HDFS directory and we are
 using *org.openx.data.jsonserde.JsonSerDe* for
 serialization/deserialization.

 Why spark takes much more time to execute the query than hive ?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-sql-thrift-server-slower-than-
 hive-tp22177.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee

Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:

Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica rvern...@gmail.com wrote:

 Hello,

 I am using takeSample from the Scala Spark 1.2.1 shell:

 scala sc.textFile(README.md).takeSample(false, 3)


 and I notice that two jobs are generated on the Spark Jobs page:

 Job Id Description
 1 takeSample at console:13
 0  takeSample at console:13


 Any ideas why the two jobs are needed?

 Thanks!
 Rares

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee

It depends on your setup but one of the locations is /var/log/mesos
On Wed, Mar 4, 2015 at 19:11 lisendong lisend...@163.com wrote:

 I ‘m sorry, but how to look at the mesos logs?
 where are them?



 在 2015年3月4日，下午6:06，Akhil Das ak...@sigmoidanalytics.com 写道：


 You can check in the mesos logs and see whats really happening.

 Thanks
 Best Regards

 On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com wrote:

 15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not
 heard
 from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket
 connection and attempting reconnect
 15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED
 15/03/04 09:26:36 INFO ZooKeeperLeaderElectionAgent: We have lost
 leadership
 15/03/04 09:26:36 ERROR Master: Leadership has been revoked -- master
 shutting down.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-shut-down-suddenly-tp21907.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee

Hi Suhel,

My team is currently working with a lot of SQL Server databases as one of
our many data sources and ultimately we pull the data into HDFS from SQL
Server. As we had a lot of SQL databases to hit, we used the jTDS driver
and SQOOP to extract the data out of SQL Server and into HDFS (small hit
against the SQL databases to extract the data out). The reasons we had
done this were to 1) minimize the impact on our SQL Servers since these
were transactional databases and we didn't want to our analytics queries to
interfere with the transactions and 2) having the data within HDFS allowed
us to centralize our relational source data within one location so we could
join / mash it with other sources of data more easily. Now that the data
is there, we just run our Spark queries against that and humming nicely.

Saying this - I have not yet had a chance to try the Spark 1.3 JDBC data
sources.

Cheng, to confirm, the reference for JDBC is
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/api/java/org/apache/spark/sql/jdbc/package-tree.html
? In the past I have not been able to get SQL queries to against SQL Server
without the use of the jTDS or Microsoft SQL Server JDBC driver for various
reason (e.g. authentication, T-SQL vs. ANSI-SQL differences, etc.) If I
needed to utilize an additional driver like jTDS, can I plug it in with
the JDBC source and/or potentially build something that will work with the
Data Sources API?

Thanks!
Denny

On Tue Feb 24 2015 at 3:20:57 AM Cheng Lian lian.cs@gmail.com wrote:

There is a newly introduced JDBC data source in Spark 1.3.0 (not the
JdbcRDD in Spark core), which may be useful. However, currently there's no
SQL server specific logics implemented. I'd assume standard SQL queries
should work.

Cheng

On 2/24/15 7:02 PM, Suhel M wrote:

Hey,

I am trying to work out what is the best way we can leverage Spark for
crunching data that is sitting in SQL Server databases.
Ideal scenario is being able to efficiently work with big data (10billion+
rows of activity data). We need to shape this data for machine learning
problems and want to do ad-hoc complex queries and get results in timely
manner.

All our data crunching is done via SQL/MDX queries, but these obviously
take a very long time to run over large data size. Also we currently don't
have hadoop or any other distributed storage.

Keen to hear feedback/thoughts/war stories from the Spark community on
best way to approach this situation.

Thanks
Suhel

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee

The error message you have is:

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory or
unable to create one)

Could you verify that you (the user you are running under) has the rights
to create the necessary folders within HDFS?


On Tue, Feb 24, 2015 at 9:06 PM kundan kumar iitr.kun...@gmail.com wrote:

 Hi ,

 I have placed my hive-site.xml inside spark/conf and i am trying to
 execute some hive queries given in the documentation.

 Can you please suggest what wrong am I doing here.



 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 hiveContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.HiveContext@3340a4b8

 scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 15/02/25 10:30:59 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
 EXISTS src (key INT, value STRING)
 15/02/25 10:30:59 INFO ParseDriver: Parse Completed
 15/02/25 10:30:59 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/02/25 10:30:59 INFO ObjectStore: ObjectStore, initialize called
 15/02/25 10:30:59 INFO Persistence: Property datanucleus.cache.level2
 unknown - will be ignored
 15/02/25 10:30:59 INFO Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/02/25 10:31:08 INFO ObjectStore: Setting MetaStore object pin classes
 with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/02/25 10:31:08 INFO MetaStoreDirectSql: MySQL check failed, assuming we
 are not on mysql: Lexical error at line 1, column 5.  Encountered: @
 (64), after : .
 15/02/25 10:31:09 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:09 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:15 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:15 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:17 INFO ObjectStore: Initialized ObjectStore
 15/02/25 10:31:17 WARN ObjectStore: Version information not found in
 metastore. hive.metastore.schema.verification is not enabled so recording
 the schema version 0.13.1aa
 15/02/25 10:31:18 INFO HiveMetaStore: Added admin role in metastore
 15/02/25 10:31:18 INFO HiveMetaStore: Added public role in metastore
 15/02/25 10:31:18 INFO HiveMetaStore: No user is added in admin role,
 since config is empty
 15/02/25 10:31:18 INFO SessionState: No Tez session required at this
 point. hive.execution.engine=mr.
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=Driver.run
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=TimeToSubmit
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO Driver: Concurrency mode is disabled, not creating
 a lock manager
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=compile
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=parse
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
 EXISTS src (key INT, value STRING)
 15/02/25 10:31:18 INFO ParseDriver: Parse Completed
 15/02/25 10:31:18 INFO PerfLogger: /PERFLOG method=parse
 start=1424840478985 end=1424840478986 duration=1
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=semanticAnalyze
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:19 INFO SemanticAnalyzer: Starting Semantic Analysis
 15/02/25 10:31:19 INFO SemanticAnalyzer: Creating table src position=27
 15/02/25 10:31:19 INFO HiveMetaStore: 0: get_table : db=default tbl=src
 15/02/25 10:31:19 INFO audit: ugi=spuser ip=unknown-ip-addr cmd=get_table
 : db=default tbl=src
 15/02/25 10:31:19 INFO HiveMetaStore: 0: get_database: default
 15/02/25 10:31:19 INFO audit: ugi=spuser ip=unknown-ip-addr cmd=get_database:
 default
 15/02/25 10:31:19 INFO Driver: Semantic Analysis Completed
 15/02/25 10:31:19 INFO PerfLogger: /PERFLOG method=semanticAnalyze
 start=1424840478986 end=1424840479063 duration=77
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:19 INFO Driver: Returning Hive schema:
 Schema(fieldSchemas:null, properties:null)

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee

That's all you should need to do. Saying this, I did run into an issue
similar to this when I was switching Spark versions which were tied to
different default Hive versions (eg Spark 1.3 by default works with Hive
0.13.1). I'm wondering if you may be hitting this issue due to that?
On Tue, Feb 24, 2015 at 22:40 kundan kumar iitr.kun...@gmail.com wrote:

 Hi Denny,

 yes the user has all the rights to HDFS. I am running all the spark
 operations with this user.

 and my hive-site.xml looks like this

  property
 namehive.metastore.warehouse.dir/name
 value/user/hive/warehouse/value
 descriptionlocation of default database for the
 warehouse/description
   /property

 Do I need to do anything explicitly other than placing hive-site.xml in
 the spark.conf directory ?

 Thanks !!



 On Wed, Feb 25, 2015 at 11:42 AM, Denny Lee denny.g@gmail.com wrote:

 The error message you have is:

 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask.
 MetaException(message:file:/user/hive/warehouse/src is not a directory
 or unable to create one)

 Could you verify that you (the user you are running under) has the rights
 to create the necessary folders within HDFS?


 On Tue, Feb 24, 2015 at 9:06 PM kundan kumar iitr.kun...@gmail.com
 wrote:

 Hi ,

 I have placed my hive-site.xml inside spark/conf and i am trying to
 execute some hive queries given in the documentation.

 Can you please suggest what wrong am I doing here.



 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 hiveContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.HiveContext@3340a4b8

 scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 warning: there were 1 deprecation warning(s); re-run with -deprecation
 for details
 15/02/25 10:30:59 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
 EXISTS src (key INT, value STRING)
 15/02/25 10:30:59 INFO ParseDriver: Parse Completed
 15/02/25 10:30:59 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/02/25 10:30:59 INFO ObjectStore: ObjectStore, initialize called
 15/02/25 10:30:59 INFO Persistence: Property datanucleus.cache.level2
 unknown - will be ignored
 15/02/25 10:30:59 INFO Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/02/25 10:30:59 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/02/25 10:31:08 INFO ObjectStore: Setting MetaStore object pin classes
 with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/02/25 10:31:08 INFO MetaStoreDirectSql: MySQL check failed, assuming
 we are not on mysql: Lexical error at line 1, column 5.  Encountered: @
 (64), after : .
 15/02/25 10:31:09 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:09 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:15 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:15 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/02/25 10:31:17 INFO ObjectStore: Initialized ObjectStore
 15/02/25 10:31:17 WARN ObjectStore: Version information not found in
 metastore. hive.metastore.schema.verification is not enabled so recording
 the schema version 0.13.1aa
 15/02/25 10:31:18 INFO HiveMetaStore: Added admin role in metastore
 15/02/25 10:31:18 INFO HiveMetaStore: Added public role in metastore
 15/02/25 10:31:18 INFO HiveMetaStore: No user is added in admin role,
 since config is empty
 15/02/25 10:31:18 INFO SessionState: No Tez session required at this
 point. hive.execution.engine=mr.
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=Driver.run
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=TimeToSubmit
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO Driver: Concurrency mode is disabled, not
 creating a lock manager
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=compile
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO PerfLogger: PERFLOG method=parse
 from=org.apache.hadoop.hive.ql.Driver
 15/02/25 10:31:18 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT
 EXISTS src (key INT, value STRING)
 15/02/25 10:31:18 INFO ParseDriver: Parse Completed
 15/02/25 10:31:18 INFO PerfLogger: /PERFLOG method=parse
 start=1424840478985 end=1424840478986 duration=1
 from=org.apache.hadoop.hive.ql.Driver
 15

Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee

It may have to do with the akka heartbeat interval per SPARK-3923 -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ?

On Tue, Feb 24, 2015 at 16:40 Xi Shen davidshe...@gmail.com wrote:

 Hi Sean,

 I launched the spark-shell on the same machine as I started YARN service.
 I don't think port will be an issue.

 I am new to spark. I checked the HDFS web UI and the YARN web UI. But I
 don't know how to check the AM. Can you help?


 Thanks,
 David


 On Tue, Feb 24, 2015 at 8:37 PM Sean Owen so...@cloudera.com wrote:

 I don't think the build is at issue. The error suggests your App Master
 can't be contacted. Is there a network port issue? did the AM fail?

 On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi Arush,

 I got the pre-build from https://spark.apache.org/downloads.html. When
 I start spark-shell, it prompts:

 Spark assembly has been built with Hive, including Datanucleus jars
 on classpath

 So we don't have pre-build with YARN support? If so, how the
 spark-submit work? I checked the YARN log, and job is really submitted and
 ran successfully.


 Thanks,
 David





 On Tue Feb 24 2015 at 6:35:38 PM Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Hi

 Are you sure that you built Spark for Yarn.If standalone works, not
 sure if its build for Yarn.

 Thanks
 Arush
 On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen davidshe...@gmail.com
 wrote:

 Hi,

 I followed this guide,
 http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
 start spark-shell with yarn-client

 ./bin/spark-shell --master yarn-client


 But I got

 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now gated 
 for [5000] ms. Reason is: [Disassociated].

 In the spark-shell, and other exceptions in they yarn log. Please see
 http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
 for more detail.


 However, submitting to the this cluster works. Also, spark-shell as
 standalone works.


 My system:

 - ubuntu amd64
 - spark 1.2.1
 - yarn from hadoop 2.6 stable


 Thanks,

 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen


 --

 [image: Sigmoid Analytics]
 http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark SQL odbc on Windows

2015-02-23 Thread Denny Lee

Makes complete sense - I became a fan of Spark for pretty much the same
reasons. Best of luck, eh?!

On Mon Feb 23 2015 at 12:08:49 AM Francisco Orchard forch...@gmail.com
wrote:

Hi Denny Ashic,

You are putting us on the right direction. Thanks!

We will try following your advice and provide feeback to the list.

Regarding your question Denny. We feel MS is lacking on an scalable
solution for SSAS (tabular or multidim) so when it comes to big data, the
only answer they have is their expensive appliance (APS) which can be used
as a rolap engine. We are interesting into testing how Spark escalate to
check if it can be offered as an less expensive alternative when a single
machine is not enough to our client needs. The reason why we do not go with
tabular in the first place is because its rolap mode (direct query) is
still too limited. And thanks for writing the klout paper!! We were already
using it as a guideline for our tests.

Best regards,
Francisco
--
From: Denny Lee denny.g@gmail.com
Sent: ‎22/‎02/‎2015 17:56
To: Ashic Mahtab as...@live.com; Francisco Orchard forch...@gmail.com;
Apache Spark user@spark.apache.org
Subject: Re: Spark SQL odbc on Windows

Back to thrift, there was an earlier thread on this topic at
http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E
that may be useful as well.

On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote:

Hi Francisco,

Out of curiosity - why ROLAP mode using multi-dimensional mode (vs
tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my
interest.

The one thing that you may run into is that the SQL generated by SSAS can
be quite convoluted. When we were doing the same thing to try to get SSAS
to connect to Hive (ref paper at
http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
that was definitely a blocker. Note that Spark SQL is different than HIVEQL
but you may run into the same issue. If so, the trick you may want to use
is similar to the paper - use a SQL Server linked server connection and
have SQL Server be your translator for the SQL generated by SSAS.

HTH!
Denny

On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote:

Hi Francisco,
While I haven't tried this, have a look at the contents of
start-thriftserver.sh - all it's doing is setting up a few variables and
calling:

/bin/spark-submit --class org.apache.spark.sql.hive.
thriftserver.HiveThriftServer2

and passing some additional parameters. Perhaps doing the same would
work?

I also believe that this hosts a jdbc server (not odbc), but there's a
free odbc connector from databricks built by Simba, with which I've been
able to connect to a spark cluster hosted on linux.

-Ashic.

--
To: user@spark.apache.org
From: forch...@gmail.com
Subject: Spark SQL odbc on Windows
Date: Sun, 22 Feb 2015 09:45:03 +0100

Hello,
I work on a MS consulting company and we are evaluating including SPARK
on our BigData offer. We are particulary interested into testing SPARK as
rolap engine for SSAS but we cannot find a way to activate the odbc server
(thrift) on a Windows custer. There is no start-thriftserver.sh command
available for windows.

Somebody knows if there is a way to make this work?

Thanks in advance!!
Francisco

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee

Hi Francisco,

Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular)
from SSAS to Spark? As a past SSAS guy you've definitely piqued my
interest.

HTH!
Denny

On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote:

Hi Francisco,
While I haven't tried this, have a look at the contents of
start-thriftserver.sh - all it's doing is setting up a few variables and
calling:

/bin/spark-submit --class
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

and passing some additional parameters. Perhaps doing the same would work?

I also believe that this hosts a jdbc server (not odbc), but there's a
free odbc connector from databricks built by Simba, with which I've been
able to connect to a spark cluster hosted on linux.

-Ashic.

--
To: user@spark.apache.org
From: forch...@gmail.com
Subject: Spark SQL odbc on Windows
Date: Sun, 22 Feb 2015 09:45:03 +0100

Hello,
I work on a MS consulting company and we are evaluating including SPARK on
our BigData offer. We are particulary interested into testing SPARK as
rolap engine for SSAS but we cannot find a way to activate the odbc server
(thrift) on a Windows custer. There is no start-thriftserver.sh command
available for windows.

Somebody knows if there is a way to make this work?

Thanks in advance!!
Francisco

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee

On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote:

Hi Francisco,

Out of curiosity - why ROLAP mode using multi-dimensional mode (vs
tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my
interest.

HTH!
Denny

On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote:

Hi Francisco,
While I haven't tried this, have a look at the contents of
start-thriftserver.sh - all it's doing is setting up a few variables and
calling:

/bin/spark-submit --class org.apache.spark.sql.hive.
thriftserver.HiveThriftServer2

and passing some additional parameters. Perhaps doing the same would work?

I also believe that this hosts a jdbc server (not odbc), but there's a
free odbc connector from databricks built by Simba, with which I've been
able to connect to a spark cluster hosted on linux.

-Ashic.

--
To: user@spark.apache.org
From: forch...@gmail.com
Subject: Spark SQL odbc on Windows
Date: Sun, 22 Feb 2015 09:45:03 +0100

Somebody knows if there is a way to make this work?

Thanks in advance!!
Francisco

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee

Oh no worries at all. If you want, I'd be glad to make updates and PR for
anything I find, eh?!
On Fri, Feb 20, 2015 at 12:18 Michael Armbrust mich...@databricks.com
wrote:

 Yeah, sorry.  The programming guide has not been updated for 1.3.  I'm
 hoping to get to that this weekend / next week.

 On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote:

 Quickly reviewing the latest SQL Programming Guide
 https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
 (in github) I had a couple of quick questions:

 1) Do we need to instantiate the SparkContext as per
 // sc is an existing SparkContext.
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 Within Spark 1.3 the sqlContext is already available so probably do not
 need to make this call.

 2) Importing org.apache.spark.sql._ should bring in both SQL data types,
 struct types, and row
 // Import Spark SQL data types and Row.
 import org.apache.spark.sql._

 Currently with Spark 1.3 RC1, it appears org.apache.spark.sql._ only
 brings in row.

 scala import org.apache.spark.sql._

 import org.apache.spark.sql._


 scala val schema =

  |   StructType(

  | schemaString.split( ).map(fieldName =
 StructField(fieldName, StringType, true)))

 console:25: error: not found: value StructType

  StructType(

 But if I also import in org.apache.spark.sql.types_

 scala import org.apache.spark.sql.types._

 import org.apache.spark.sql.types._


 scala val schema =

  |   StructType(

  | schemaString.split( ).map(fieldName =
 StructField(fieldName, StringType, true)))

 schema: org.apache.spark.sql.types.StructType =
 StructType(StructField(DeviceMake,StringType,true),
 StructField(Country,StringType,true))

 Wondering if this is by design or perhaps a quick documentation / package
 update is warranted.

Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee

Quickly reviewing the latest SQL Programming Guide
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
(in github) I had a couple of quick questions:

1) Do we need to instantiate the SparkContext as per
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Within Spark 1.3 the sqlContext is already available so probably do not
need to make this call.

2) Importing org.apache.spark.sql._ should bring in both SQL data types,
struct types, and row
// Import Spark SQL data types and Row.
import org.apache.spark.sql._

Currently with Spark 1.3 RC1, it appears org.apache.spark.sql._ only brings
in row.

scala import org.apache.spark.sql._

import org.apache.spark.sql._


scala val schema =

 |   StructType(

 | schemaString.split( ).map(fieldName = StructField(fieldName,
StringType, true)))

console:25: error: not found: value StructType

 StructType(

But if I also import in org.apache.spark.sql.types_

scala import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


scala val schema =

 |   StructType(

 | schemaString.split( ).map(fieldName = StructField(fieldName,
StringType, true)))

schema: org.apache.spark.sql.types.StructType =
StructType(StructField(DeviceMake,StringType,true),
StructField(Country,StringType,true))

Wondering if this is by design or perhaps a quick documentation / package
update is warranted.

Re: Tableau beta connector

2015-02-05 Thread Denny Lee

Could you clarify what you mean by build another Spark and work through
Spark Submit?

If you are referring to utilizing Spark spark and thrift, you could start
the Spark service and then have your spark-shell, spark-submit, and/or
thrift service aim at the master you have started.

On Thu Feb 05 2015 at 2:02:04 AM Ashutosh Trivedi (MT2013030) 
ashutosh.triv...@iiitb.org wrote:

  Hi Denny , Ismail one last question..


  Is it necessary to build another Spark and work through Spark-submit ?


  I work on IntelliJ using SBT as build script, I have Hive set up with
 postgres as metastore, I can run the hive server using command

 *hive --service metastore*

 *hive --service hiveserver2*


  After that if I can use hive-context in my code

 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)


  Do some processing on RDD and persist it on hive using  registerTempTable

 and tableau can extract that RDD persisted on hive.


  Regards,

 Ashutosh


  --
 *From:* Denny Lee denny.g@gmail.com

 *Sent:* Thursday, February 5, 2015 1:27 PM
 *To:* Ashutosh Trivedi (MT2013030); İsmail Keskin
 *Cc:* user@spark.apache.org
 *Subject:* Re: Tableau beta connector
 The context is that you would create your RDDs and then persist them in
 Hive. Once in Hive, the data is accessible from the Tableau extract through
 Spark thrift server.
 On Wed, Feb 4, 2015 at 23:36 Ashutosh Trivedi (MT2013030) 
 ashutosh.triv...@iiitb.org wrote:

  Thanks Denny and Ismail.


  Denny ,I went through your blog, It was great help. I guess tableau
 beta connector also following the same procedure,you described in blog. I
 am building the Spark now.

 Basically what I don't get is, where to put my data so that tableau can
 extract.


  So  Ismail,its just Spark SQL. No RDDs I think I am getting it now . We
 use spark for our big data processing and we want *processed data (Rdd)*
 into tableau. So we should put our data in hive metastore and tableau will
 extract it from there using this connector? Correct me if I am wrong.


  I guess I have to look at how thrift server works.
  --
 *From:* Denny Lee denny.g@gmail.com
 *Sent:* Thursday, February 5, 2015 12:20 PM
 *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030)
 *Cc:* user@spark.apache.org
 *Subject:* Re: Tableau beta connector

 Some quick context behind how Tableau interacts with Spark / Hive
 can also be found at
 https://www.concur.com/blog/en-us/connect-tableau-to-sparksql  - its for
 how to connect from Tableau to the thrift server before the official
 Tableau beta connector but should provide some of the additional context
 called out.   HTH!

 On Wed Feb 04 2015 at 10:47:23 PM İsmail Keskin 
 ismail.kes...@dilisim.com wrote:

 Tableau connects to Spark Thrift Server via an ODBC driver. So, none of
 the RDD stuff applies, you just issue SQL queries from Tableau.

  The table metadata can come from Hive Metastore if you place your
 hive-site.xml to configuration directory of Spark.

 On Thu, Feb 5, 2015 at 8:11 AM, ashu ashutosh.triv...@iiitb.org wrote:

 Hi,
 I am trying out the tableau beta connector to Spark SQL. I have few
 basics
 question:
 Will this connector be able to fetch the schemaRDDs into tableau.
 Will all the schemaRDDs be exposed to tableau?
 Basically I am not getting what tableau will fetch at data-source? Is it
 existing files in HDFS? RDDs or something else.
 Question may be naive but I did not get answer anywhere else. Would
 really
 appreciate if someone has already tried it, can help me with this.

 Thanks,
 Ashutosh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Tableau-beta-connector-tp21512.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Tableau beta connector

2015-02-04 Thread Denny Lee

The context is that you would create your RDDs and then persist them in
Hive. Once in Hive, the data is accessible from the Tableau extract through
Spark thrift server.
On Wed, Feb 4, 2015 at 23:36 Ashutosh Trivedi (MT2013030) 
ashutosh.triv...@iiitb.org wrote:

  Thanks Denny and Ismail.


  Denny ,I went through your blog, It was great help. I guess tableau beta
 connector also following the same procedure,you described in blog. I am
 building the Spark now.

 Basically what I don't get is, where to put my data so that tableau can
 extract.


  So  Ismail,its just Spark SQL. No RDDs I think I am getting it now . We
 use spark for our big data processing and we want *processed data (Rdd)*
 into tableau. So we should put our data in hive metastore and tableau will
 extract it from there using this connector? Correct me if I am wrong.


  I guess I have to look at how thrift server works.
  --
 *From:* Denny Lee denny.g@gmail.com
 *Sent:* Thursday, February 5, 2015 12:20 PM
 *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030)
 *Cc:* user@spark.apache.org
 *Subject:* Re: Tableau beta connector

  Some quick context behind how Tableau interacts with Spark / Hive can
 also be found at
 https://www.concur.com/blog/en-us/connect-tableau-to-sparksql  - its for
 how to connect from Tableau to the thrift server before the official
 Tableau beta connector but should provide some of the additional context
 called out.   HTH!

 On Wed Feb 04 2015 at 10:47:23 PM İsmail Keskin ismail.kes...@dilisim.com
 wrote:

 Tableau connects to Spark Thrift Server via an ODBC driver. So, none of
 the RDD stuff applies, you just issue SQL queries from Tableau.

  The table metadata can come from Hive Metastore if you place your
 hive-site.xml to configuration directory of Spark.

 On Thu, Feb 5, 2015 at 8:11 AM, ashu ashutosh.triv...@iiitb.org wrote:

 Hi,
 I am trying out the tableau beta connector to Spark SQL. I have few
 basics
 question:
 Will this connector be able to fetch the schemaRDDs into tableau.
 Will all the schemaRDDs be exposed to tableau?
 Basically I am not getting what tableau will fetch at data-source? Is it
 existing files in HDFS? RDDs or something else.
 Question may be naive but I did not get answer anywhere else. Would
 really
 appreciate if someone has already tried it, can help me with this.

 Thanks,
 Ashutosh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Tableau-beta-connector-tp21512.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee

Hi Ningjun,

I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely
for development purposes). I had most recently installed them utilizing
Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy
thread concerning the null\bin\winutils issue is addressed in an earlier
thread at:
http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-unit-test-on-Windows-7-td8656.html

Hope this helps a little bit!
Denny

On Tue Feb 03 2015 at 8:24:24 AM Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:

Hi Gen

Thanks for your feedback. We do have a business reason to run spark on
windows. We have an existing application that is built on C# .NET running
on windows. We are considering adding spark to the application for parallel
processing of large data. We want spark to run on windows so it integrate
with our existing app easily.

Has anybody use spark on windows for production system? Is spark reliable
on windows?

Ningjun

*From:* gen tang [mailto:gen.tan...@gmail.com]
*Sent:* Thursday, January 29, 2015 12:53 PM

*To:* Wang, Ningjun (LNG-NPV)
*Cc:* user@spark.apache.org
*Subject:* Re: Fail to launch spark-shell on windows 2008 R2

Hi,

Using spark under windows is a really bad idea, because even you solve the
problems about hadoop, you probably will meet the problem of
java.net.SocketException. connection reset by peer. It is caused by the
fact we ask socket port too frequently under windows. In my knowledge, it
is really difficult to solve. And you will find something really funny: the
same code sometimes works and sometimes not, even in the shell mode.

And I am sorry but I don't see the interest to run spark under windows and
moreover using local file system in a business environment. Do you have a
cluster in windows?

FYI, I have used spark prebuilt on hadoop 1 under windows 7 and there is
no problem to launch, but have problem of java.net.SocketException. If you
are using spark prebuilt on hadoop 2, you should consider follow the
solution provided by https://issues.apache.org/jira/browse/SPARK-2356

Cheers

Gen

On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:

Install virtual box which run Linux? That does not help us. We have
business reason to run it on Windows operating system, e.g. Windows 2008 R2.

If anybody have done that, please give some advise on what version of
spark, which version of Hadoop do you built spark against, etc…. Note that
we only use local file system and do not have any hdfs file system at all.
I don’t understand why spark generate so many error on Hadoop while we
don’t even need hdfs.

Ningjun

*From:* gen tang [mailto:gen.tan...@gmail.com]
*Sent:* Thursday, January 29, 2015 10:45 AM
*To:* Wang, Ningjun (LNG-NPV)
*Cc:* user@spark.apache.org
*Subject:* Re: Fail to launch spark-shell on windows 2008 R2

Hi,

I tried to use spark under windows once. However the only solution that I
found is to install virtualbox

Hope this can help you.

Best

Gen

On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:

I deployed spark-1.1.0 on Windows 7 and was albe to launch the
spark-shell. I then deploy it to windows 2008 R2 and launch the
spark-shell, I got the error

java.lang.RuntimeException: Error while running command to get file
permissions : java.io.IOExceptio

n: Cannot run program ls: CreateProcess error=2, The system cannot find
the file specified

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)

at org.apache.hadoop.util.Shell.run(Shell.java:182)

at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)

at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)

at
org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil

eSystem.java:443)

at
org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst

em.java:418)

Here is the detail output

C:\spark-1.1.0\bin spark-shell

15/01/29 10:13:13 INFO SecurityManager: Changing view acls to:
ningjun.wang,

15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to:
ningjun.wang,

15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled;

users with view permissions: Set(ningjun.wang, ); users with modify
permissions: Set(ningjun.wang, )

15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server

15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT

15/01/29 10:13:14 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:53692

15/01/29 10:13:14 INFO Utils: Successfully

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee

A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted
is at: OLAP with Cassandra and Spark
http://www.slideshare.net/EvanChan2/2014-07olapcassspark.

On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote:

 Write out the rdd to a cassandra table.  The datastax driver provides
 saveToCassandra() for this purpose.

 On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi,

 After some research I have decided that Spark (SQL) would be ideal for
 building an OLAP engine. My goal is to push aggregated data (to Cassandra
 or other low-latency data storage) and then be able to project the results
 on a web page (web service). New data will be added (aggregated) once a
 day, only. On the other hand, the web service must be able to run some
 fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
 results with D3.js. Note that I can already achieve similar speeds while in
 REPL mode by caching the data. Therefore, I believe that my problem must be
 re-phrased as follows: How can I automatically cache the data once a day
 and make them available on a web service that is capable of running any
 Spark or Spark (SQL)  statement in order to plot the results with D3.js?

 Note that I have already some experience in Spark (+Spark SQL) as well as
 D3.js but not at all with OLAP engines (at least in their traditional form).

 Any ideas or suggestions?


 *// Adamantios*

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee

Cool!  For all the times i had been modifying the hive-site.xml I had only
propped in the integer values - learn something new every day, eh?!


On Sun Feb 01 2015 at 9:36:23 AM Ted Yu yuzhih...@gmail.com wrote:

 Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :


 METASTORE_CLIENT_CONNECT_RETRY_DELAY(hive.metastore.client.connect.retry.delay,
 1s,
 new TimeValidator(TimeUnit.SECONDS),
 Number of seconds for the client to wait between consecutive
 connection attempts),

 It seems having the 's' suffix is legitimate.

 On Sun, Feb 1, 2015 at 9:14 AM, Denny Lee denny.g@gmail.com wrote:

 I may be missing something here but typically when the hive-site.xml
 configurations do not require you to place s within the configuration
 itself.  Both the retry.delay and socket.timeout values are in seconds so
 you should only need to place the integer value (which are in seconds).


 On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982 guxiaobo1...@qq.com
 wrote:

 Hi,

 To order to let a local spark-shell connect to  a remote spark
 stand-alone cluster and access  hive tables there, I must put the
 hive-site.xml file into the local spark installation's conf path, but
 spark-shell even can't import the default settings there, I found two
 errors:

 property

   namehive.metastore.client.connect.retry.delay/name

   value5s/value

 /property

 property

   namehive.metastore.client.socket.timeout/name

   value1800s/value

 /property
 Spark-shell try to read 5s and 1800s and integers, they must be changed
 to 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
 versions.

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee

I may be missing something here but typically when the hive-site.xml
configurations do not require you to place s within the configuration
itself.  Both the retry.delay and socket.timeout values are in seconds so
you should only need to place the integer value (which are in seconds).

On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982 guxiaobo1...@qq.com wrote:

 Hi,

 To order to let a local spark-shell connect to  a remote spark stand-alone
 cluster and access  hive tables there, I must put the hive-site.xml file
 into the local spark installation's conf path, but spark-shell even can't
 import the default settings there, I found two errors:

 property

   namehive.metastore.client.connect.retry.delay/name

   value5s/value

 /property

 property

   namehive.metastore.client.socket.timeout/name

   value1800s/value

 /property
 Spark-shell try to read 5s and 1800s and integers, they must be changed to
 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
 versions.

Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?

2014-12-30 Thread Denny Lee

I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the
spark.executor.uri within spark-env.sh (and directly within bash as well),
the Mesos slaves do not seem to be able to access the spark tgz file via
HTTP or HDFS as per the message below.


14/12/30 15:57:35 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala 14/12/30 15:57:38 INFO CoarseMesosSchedulerBackend: Mesos task 0 is
now TASK_FAILED
14/12/30 15:57:38 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now
TASK_FAILED
14/12/30 15:57:39 INFO CoarseMesosSchedulerBackend: Mesos task 2 is now
TASK_FAILED
14/12/30 15:57:41 INFO CoarseMesosSchedulerBackend: Mesos task 3 is now
TASK_FAILED
14/12/30 15:57:41 INFO CoarseMesosSchedulerBackend: Blacklisting Mesos
slave value: 20141228-183059-3045950474-5050-2788-S1
 due to too many failures; is Spark installed on it?


I've verified that the Mesos slaves can access both the HTTP and HDFS
locations.  I'll start digging into the Mesos logs but was wondering if
anyone had run into this issue before.  I was able to get this to run
successfully on Spark 1.1 on GCP - my current environment that I'm
experimenting with is Digital Ocean - perhaps this is in play?

Thanks!
Denny

Re: S3 files , Spark job hungsup

2014-12-23 Thread Denny Lee

You should be able to kill the job using the webUI or via spark-class.
More info can be found in the thread:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html.


HTH!

On Tue, Dec 23, 2014 at 4:47 PM, durga durgak...@gmail.com wrote:

 Hi All ,

 It seems problem is little more complicated.

 If the job is hungup on reading s3 file.even if I kill the unix process
 that
 started the job, it is not killing spark-job. It is still hung up there.

 Now the questions are :

 How do I find spark-job based on the name?
 How do I kill the spark-job based on the name of the job?.

 Thanks for helping me.
 -D



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806p20842.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee

To clarify, there isn't a Hadoop 2.6 profile per se but you can build using
-Dhadoop.version=2.4 which works with Hadoop 2.6.

On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote:

 You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0

 Cheers

 On Fri, Dec 19, 2014 at 12:51 PM, sa asuka.s...@gmail.com wrote:

 Can Spark be built with Hadoop 2.6? All I see instructions up to are for
 2.4
 and there does not seem to be a hadoop2.6 profile. If it works with Hadoop
 2.6, can anyone recommend how to build?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-2-6-compatibility-tp20790.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee

Sorry Ted! I saw profile (-P) but missed the -D. My bad!
On Fri, Dec 19, 2014 at 16:46 Ted Yu yuzhih...@gmail.com wrote:

 Here is the command I used:

 mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -Phive -DskipTests

 FYI

 On Fri, Dec 19, 2014 at 4:35 PM, Denny Lee denny.g@gmail.com wrote:

 To clarify, there isn't a Hadoop 2.6 profile per se but you can build
 using -Dhadoop.version=2.4 which works with Hadoop 2.6.

 On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote:

 You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0

 Cheers

 On Fri, Dec 19, 2014 at 12:51 PM, sa asuka.s...@gmail.com wrote:

 Can Spark be built with Hadoop 2.6? All I see instructions up to are
 for 2.4
 and there does not seem to be a hadoop2.6 profile. If it works with
 Hadoop
 2.6, can anyone recommend how to build?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-2-6-compatibility-tp20790.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee

I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
- Hadoop - GCS Connector - GCS.

On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com
wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee

Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans?

On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com
wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee

For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars
for connectivity. I'm wondering if it's those connection points that are
ultimately slowing down the connection between Spark and GCS.

The reason I was asking if you could run bdutil is because it would be
basically Hadoop connecting to GCS. If it's just as slow than that would
point to the root cause. That is, it's the Hadoop connection that is
slowing things vs something explicitly out of Spark per se.
On Wed, Dec 17, 2014 at 23:25 Alessandro Baretta alexbare...@gmail.com
wrote:

 Well, what do you suggest I run to test this? But more importantly, what
 information would this give me?

 On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex

Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee

I have a large of files within HDFS that I would like to do a group by
statement ala

val table = sc.textFile(hdfs://)
val tabs = table.map(_.split(\t))

I'm trying to do something similar to
tabs.map(c = (c._(167), c._(110), c._(200))

where I create a new RDD that only has
but that isn't quite right because I'm not really manipulating sequences.

BTW, I cannot use SparkSQL / case right now because my table has 200
columns (and I'm on Scala 2.10.3)

Thanks!
Denny

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee

Getting a bunch of syntax errors. Let me get back with the full statement
and error later today. Thanks for verifying my thinking wasn't out in left
field.
On Sun, Dec 14, 2014 at 08:56 Gerard Maas gerard.m...@gmail.com wrote:

 Hi,

 I don't get what the problem is. That map to selected columns looks like
 the way to go given the context. What's not working?

 Kr, Gerard
 On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote:

 I have a large of files within HDFS that I would like to do a group by
 statement ala

 val table = sc.textFile(hdfs://)
 val tabs = table.map(_.split(\t))

 I'm trying to do something similar to
 tabs.map(c = (c._(167), c._(110), c._(200))

 where I create a new RDD that only has
 but that isn't quite right because I'm not really manipulating sequences.

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)

 Thanks!
 Denny

1 2 >

1 - 100 of 145 matches

Mail list logo