Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Thanks, Mich for your reply.

I agree, it is not so scalable and efficient. But it works correctly for
kafka transaction, and there is no problem with committing offset to kafka
async for now.

I try to tell you some more details about my streaming job.
CustomReceiver does not receive anything from outside and just forward
notice message to run an executor in which kafka consumer will be run.
See my CustomReceiver.

private static class CustomReceiver extends Receiver {

public CustomReceiver() {
super(StorageLevel.MEMORY_AND_DISK_2());
}

@Override
public void onStart() {
new Thread(this::receive).start();
}

private void receive() {
String input = "receiver input " + UUID.randomUUID().toString();
store(input);
}

@Override
public void onStop() {

}
}


Actually, just one Kafka consumer will be run which consumes committed
messages from kafka directly(, which is not so scalable, I think.).
But the main point of this approach which I need is that spark
session needs to be used to save rdd(parallelized consumed messages) to
iceberg table.
Consumed messages will be converted to spark rdd which will be saved to
iceberg table using spark session.

I have tested this spark streaming job with transactional producers which
send several millions of messages. Correctly consumed and saved to iceberg
tables correctly.

- Kidong.



2024년 4월 14일 (일) 오후 11:05, Mich Talebzadeh 님이 작성:

> Interesting
>
> My concern is infinite Loop in* foreachRDD*: The *while(true)* loop
> within foreachRDD creates an infinite loop within each Spark executor. This
> might not be the most efficient approach, especially since offsets are
> committed asynchronously.?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Sun, 14 Apr 2024 at 13:40, Kidong Lee  wrote:
>
>>
>> Because spark streaming for kafk transaction does not work correctly to
>> suit my need, I moved to another approach using raw kafka consumer which
>> handles read_committed messages from kafka correctly.
>>
>> My codes look like the following.
>>
>> JavaDStream stream = ssc.receiverStream(new CustomReceiver()); // 
>> CustomReceiver does nothing special except awaking foreach task.
>>
>> stream.foreachRDD(rdd -> {
>>
>>   KafkaConsumer consumer = new 
>> KafkaConsumer<>(consumerProperties);
>>
>>   consumer.subscribe(Arrays.asList(topic));
>>
>>   while(true){
>>
>> ConsumerRecords records =
>> consumer.poll(java.time.Duration.ofMillis(intervalMs));
>>
>> Map offsetMap = new HashMap<>();
>>
>> List someList = new ArrayList<>();
>>
>> for (ConsumerRecord consumerRecord : records) {
>>
>>   // add something to list.
>>
>>   // put offset to offsetMap.
>>
>> }
>>
>> // process someList.
>>
>> // commit offset.
>>
>> consumer.commitAsync(offsetMap, null);
>>
>>   }
>>
>> });
>>
>>
>> In addition, I increased max.poll.records to 10.
>>
>> Even if this raw kafka consumer approach is not so scalable, it consumes
>> read_committed messages from kafka correctly and is enough for me at the
>> moment.
>>
>> - Kidong.
>>
>>
>>
>> 2024년 4월 12일 (금) 오후 9:19, Kidong Lee 님이 작성:
>>
>>> Hi,
>>>
>>> I have a kafka producer which sends messages transactionally to kafka
>>> and spark streaming job which should consume read_committed messages from
>>> kafka.
>>> But there is a problem for spark streaming to consume read_committed
>>> messages.
>>> The count of messages sent by kafka producer transactionally is not the
>>> same to the count of the read_committed messages consumed by spark
>>> streaming.
>>>
>>> Some consumer properties of my spark streaming job are as follows.
>>>
>>> auto.offset.reset=earliest
>>> enable.auto.commit=false
>>> isolat

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to
suit my need, I moved to another approach using raw kafka consumer which
handles read_committed messages from kafka correctly.

My codes look like the following.

JavaDStream stream = ssc.receiverStream(new CustomReceiver());
// CustomReceiver does nothing special except awaking foreach task.

stream.foreachRDD(rdd -> {

  KafkaConsumer consumer = new
KafkaConsumer<>(consumerProperties);

  consumer.subscribe(Arrays.asList(topic));

  while(true){

ConsumerRecords records =
consumer.poll(java.time.Duration.ofMillis(intervalMs));

Map offsetMap = new HashMap<>();

List someList = new ArrayList<>();

for (ConsumerRecord consumerRecord : records) {

  // add something to list.

  // put offset to offsetMap.

}

// process someList.

// commit offset.

consumer.commitAsync(offsetMap, null);

  }

});


In addition, I increased max.poll.records to 10.

Even if this raw kafka consumer approach is not so scalable, it consumes
read_committed messages from kafka correctly and is enough for me at the
moment.

- Kidong.



2024년 4월 12일 (금) 오후 9:19, Kidong Lee 님이 작성:

> Hi,
>
> I have a kafka producer which sends messages transactionally to kafka and
> spark streaming job which should consume read_committed messages from kafka.
> But there is a problem for spark streaming to consume read_committed
> messages.
> The count of messages sent by kafka producer transactionally is not the
> same to the count of the read_committed messages consumed by spark
> streaming.
>
> Some consumer properties of my spark streaming job are as follows.
>
> auto.offset.reset=earliest
> enable.auto.commit=false
> isolation.level=read_committed
>
>
> I also added the following spark streaming configuration.
>
> sparkConf.set("spark.streaming.kafka.allowNonConsecutiveOffsets", "true");
> sparkConf.set("spark.streaming.kafka.consumer.poll.ms", String.valueOf(2 * 60 
> * 1000));
>
>
> My spark streaming is using DirectStream like this.
>
> JavaInputDStream> stream =
> KafkaUtils.createDirectStream(
> ssc,
> LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(topics, 
> kafkaParams)
> );
>
>
> stream.foreachRDD(rdd -> O
>
>// get offset ranges.
>
>OffsetRange[] offsetRanges = ((HasOffsetRanges) 
> rdd.rdd()).offsetRanges();
>
>// process something.
>
>
>// commit offset.
>((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
>
> }
> );
>
>
>
> I have tested with a kafka consumer written with raw kafka-clients jar
> library without problem that it consumes read_committed messages correctly,
> and the count of consumed read_committed messages is equal to the count of
> messages sent by kafka producer.
>
>
> And sometimes, I got the following exception.
>
> Job aborted due to stage failure: Task 0 in stage 324.0 failed 1 times,
> most recent failure: Lost task 0.0 in stage 324.0 (TID 1674)
> (chango-private-1.chango.private executor driver):
> java.lang.IllegalArgumentException: requirement failed: Failed to get
> records for compacted spark-executor-school-student-group school-student-7
> after polling for 12
>
> at scala.Predef$.require(Predef.scala:281)
>
> at
> org.apache.spark.streaming.kafka010.InternalKafkaConsumer.compactedNext(KafkaDataConsumer.scala:186)
>
> at
> org.apache.spark.streaming.kafka010.KafkaDataConsumer.compactedNext(KafkaDataConsumer.scala:60)
>
> at
> org.apache.spark.streaming.kafka010.KafkaDataConsumer.compactedNext$(KafkaDataConsumer.scala:59)
>
> at
> org.apache.spark.streaming.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.compactedNext(KafkaDataConsumer.scala:219)
>
>
>
> I have experienced spark streaming job which works fine with kafka
> messages which are non-transactional, and I never encountered the
> exceptions like above.
> It seems that spark streaming for kafka transaction does not handle such
> as kafka consumer properties like isolation.level=read_committed and
> enable.auto.commit=false correctly.
>
> Any help appreciated.
>
> - Kidong.
>
>
> --
> *이기동 *
> *Kidong Lee*
>
> Email: mykid...@gmail.com
> Chango: https://cloudcheflabs.github.io/chango-private-docs
> Web Site: http://www.cloudchef-labs.com/
> Mobile: +82 10 4981 7297
> <http://www.cloudchef-labs.com/>
>


-- 
*이기동 *
*Kidong Lee*

Email: mykid...@gmail.com
Chango: https://cloudcheflabs.github.io/chango-private-docs
Web Site: http://www.cloudchef-labs.com/
Mobile: +82 10 4981 7297
<http://www.cloudchef-labs.com/>


Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
Thank you Mich for your reply.

Actually, I tried to do most of your advice.

When spark.streaming.kafka.allowNonConsecutiveOffsets=false, I got the
following error.

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
recent failure: Lost task 0.0 in stage 1.0 (TID 3)
(chango-private-1.chango.private executor driver):
java.lang.IllegalArgumentException: requirement failed: Got wrong record
for spark-executor-school-student-group school-student-7 even after seeking
to offset 11206961 got offset 11206962 instead. If this is a compacted
topic, consider enabling spark.streaming.kafka.allowNonConsecutiveOffsets

at scala.Predef$.require(Predef.scala:281)

at
org.apache.spark.streaming.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:155)

at
org.apache.spark.streaming.kafka010.KafkaDataConsumer.get(KafkaDataConsumer.scala:40)

at
org.apache.spark.streaming.kafka010.KafkaDataConsumer.get$(KafkaDataConsumer.scala:39)

at
org.apache.spark.streaming.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:219)


And I tried to increase spark.streaming.kafka.consumer.poll.ms to avoid the
exceptions, but it did not help.


- Kidong.




2024년 4월 14일 (일) 오전 4:25, Mich Talebzadeh 님이 작성:

> Hi Kidong,
>
> There may be few potential reasons why the message counts from your Kafka
> producer and Spark Streaming consumer might not match, especially with
> transactional messages and read_committed isolation level.
>
> 1) Just ensure that both your Spark Streaming job and the Kafka consumer
> written with raw kafka-clients use the same consumer group. Messages are
> delivered to specific consumer groups, and if they differ, Spark Streaming
> might miss messages consumed by the raw consumer.
> 2) Your Spark Streaming configuration sets *enable.auto.commit=false* and
> uses *commitAsync manually*. However, I noted
> *spark.streaming.kafka.allowNonConsecutiveOffsets=true* which may be
> causing the problem. This setting allows Spark Streaming to read offsets
> that are not strictly increasing, which can happen with transactional
> reads. Generally recommended to set this to* false *for transactional
> reads to ensure Spark Streaming only reads committed messages.
> 3) Missed messages, in transactional messages, Kafka guarantees *delivery
> only after the transaction commits successfully. *There could be a slight
> delay between the producer sending the message and it becoming visible to
> consumers under read_committed isolation level. Spark Streaming could
> potentially miss messages during this window.
> 4) The exception Lost task 0.0 in stage 324.0, suggests a problem fetching
> records for a specific topic partition. Review your code handling of
> potential exceptions during rdd.foreachRDD processing. Ensure retries or
> appropriate error handling if encountering issues with specific partitions.
> 5) Try different configurations for *spark.streaming.kafka.consumer.poll.ms
> <http://spark.streaming.kafka.consumer.poll.ms>* to adjust polling
> frequency and potentially improve visibility into committed messages.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Fri, 12 Apr 2024 at 21:38, Kidong Lee  wrote:
>
>> Hi,
>>
>> I have a kafka producer which sends messages transactionally to kafka and
>> spark streaming job which should consume read_committed messages from kafka.
>> But there is a problem for spark streaming to consume read_committed
>> messages.
>> The count of messages sent by kafka producer transactionally is not the
>> same to the count of the read_committed messages consumed by spark
>> streaming.
>>
>> Some consumer properties of my spark streaming job are as follows.
>>
>> auto.offset.reset=earliest
>> enable.auto.commit=false
>> isolation.level=read_committed
>>
>>
>> I also added the following spark streaming configuration.
>>
>> sparkConf.set("spark.streaming.kafka.allowNonConsecutiveOffsets", "true");
>> sparkConf.set("spark.streaming.kafka.consumer.poll.ms", String.valueOf(2 * 
>> 60 * 1000));
>>
>

Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-12 Thread Kidong Lee
Hi,

I have a kafka producer which sends messages transactionally to kafka and
spark streaming job which should consume read_committed messages from kafka.
But there is a problem for spark streaming to consume read_committed
messages.
The count of messages sent by kafka producer transactionally is not the
same to the count of the read_committed messages consumed by spark
streaming.

Some consumer properties of my spark streaming job are as follows.

auto.offset.reset=earliest
enable.auto.commit=false
isolation.level=read_committed


I also added the following spark streaming configuration.

sparkConf.set("spark.streaming.kafka.allowNonConsecutiveOffsets", "true");
sparkConf.set("spark.streaming.kafka.consumer.poll.ms",
String.valueOf(2 * 60 * 1000));


My spark streaming is using DirectStream like this.

JavaInputDStream> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topics, kafkaParams)
);


stream.foreachRDD(rdd -> O

   // get offset ranges.

   OffsetRange[] offsetRanges = ((HasOffsetRanges)
rdd.rdd()).offsetRanges();

   // process something.


   // commit offset.
   ((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);

}
);



I have tested with a kafka consumer written with raw kafka-clients jar
library without problem that it consumes read_committed messages correctly,
and the count of consumed read_committed messages is equal to the count of
messages sent by kafka producer.


And sometimes, I got the following exception.

Job aborted due to stage failure: Task 0 in stage 324.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 324.0 (TID 1674)
(chango-private-1.chango.private executor driver):
java.lang.IllegalArgumentException: requirement failed: Failed to get
records for compacted spark-executor-school-student-group school-student-7
after polling for 12

at scala.Predef$.require(Predef.scala:281)

at
org.apache.spark.streaming.kafka010.InternalKafkaConsumer.compactedNext(KafkaDataConsumer.scala:186)

at
org.apache.spark.streaming.kafka010.KafkaDataConsumer.compactedNext(KafkaDataConsumer.scala:60)

at
org.apache.spark.streaming.kafka010.KafkaDataConsumer.compactedNext$(KafkaDataConsumer.scala:59)

at
org.apache.spark.streaming.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.compactedNext(KafkaDataConsumer.scala:219)



I have experienced spark streaming job which works fine with kafka messages
which are non-transactional, and I never encountered the exceptions like
above.
It seems that spark streaming for kafka transaction does not handle such as
kafka consumer properties like isolation.level=read_committed and
enable.auto.commit=false correctly.

Any help appreciated.

- Kidong.


-- 
*이기동 *
*Kidong Lee*

Email: mykid...@gmail.com
Chango: https://cloudcheflabs.github.io/chango-private-docs
Web Site: http://www.cloudchef-labs.com/
Mobile: +82 10 4981 7297
<http://www.cloudchef-labs.com/>


Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
Welcome Ram! :-)

I would recommend you to check
https://issues.apache.org/jira/browse/SPARK-37935 out as a starter task.

Refer to https://github.com/apache/spark/pull/41504,
https://github.com/apache/spark/pull/41455 as an example PR.

Or you can also add a new sub-task if you find any error messages that need
improvement.

Thanks!

On Mon, Sep 18, 2023 at 9:33 AM Denny Lee  wrote:

> Hi Ram,
>
> We have some good guidance at
> https://spark.apache.org/contributing.html
>
> HTH!
> Denny
>
>
> On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:
>
>>
>>
>>
>> Hello All,
>> Recently, joined this community and would like to contribute. Is there a
>> guideline or recommendation on tasks that can be picked up by a first timer
>> or a started task?.
>>
>> Tried looking at stack overflow tag: apache-spark
>> <https://stackoverflow.com/questions/tagged/apache-spark>, couldn't find
>> any information for first time contributors.
>>
>> Looking forward to learning and contributing.
>>
>> Thanks
>> Ram
>>
>


Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram,

We have some good guidance at
https://spark.apache.org/contributing.html

HTH!
Denny


On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:

>
>
>
> Hello All,
> Recently, joined this community and would like to contribute. Is there a
> guideline or recommendation on tasks that can be picked up by a first timer
> or a started task?.
>
> Tried looking at stack overflow tag: apache-spark
> , couldn't find
> any information for first time contributors.
>
> Looking forward to learning and contributing.
>
> Thanks
> Ram
>


Unsubscribe

2023-06-29 Thread lee
Unsubscribe 


| |
李杰
|
|
leedd1...@163.com
|

Re: Slack for PySpark users

2023-04-03 Thread Denny Lee
;
>>>>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>>>>> section of "Open Communications", restriction of communication inside ASF
>>>>>> INFRA (mailing list) is more about code and decision-making.
>>>>>>
>>>>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>>>>
>>>>>> It's unavoidable if "users" prefer to use an alternative
>>>>>> communication mechanism rather than the user mailing list. Before Stack
>>>>>> Overflow days, there had been a meaningful number of questions around 
>>>>>> user@.
>>>>>> It's just impossible to let them go back and post to the user mailing 
>>>>>> list.
>>>>>>
>>>>>> We just need to make sure it is not the purpose of employing Slack to
>>>>>> move all discussions about developments, direction of the project, etc
>>>>>> which must happen in dev@/private@. The purpose of Slack thread here
>>>>>> does not seem to aim to serve the purpose.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Good discussions and proposals.all around.
>>>>>>>
>>>>>>> I have used slack in anger on a customer site before. For small and
>>>>>>> medium size groups it is good and affordable. Alternatives have been
>>>>>>> suggested as well so those who like investigative search can agree and 
>>>>>>> come
>>>>>>> up with a freebie one.
>>>>>>> I am inclined to agree with Bjorn that this slack has more social
>>>>>>> dimensions than the mailing list. It is akin to a sports club using
>>>>>>> WhatsApp groups for communication. Remember we were originally looking 
>>>>>>> for
>>>>>>> space for webinars, including Spark on Linkedin that Denney Lee 
>>>>>>> suggested.
>>>>>>> I think Slack and mailing groups can coexist happily. On a more serious
>>>>>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>>>>>> traffic. Currently we hardly get many mails daily <> less than 5. So 
>>>>>>> having
>>>>>>> a slack type medium may improve members participation.
>>>>>>>
>>>>>>> so +1 for me as well.
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>> Palantir Technologies Limited
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 30 Mar 2023 at 22:19, Denny Lee 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1.
>>>>>>>>
>>>>>>>> To Shani’s point, there are multiple OSS projects that use the free
>>>>>>>> Slack version - top of mind include Delta, Presto, Flink, Trino, 
>>>>>>>> Datahub,
>>>>>>>> MLflow, etc.
>>>>>>>>
>>>>>>>> On Thu, Mar 30, 2023 at 14:15  wrote:
>>>>>>

Re: Slack for PySpark users

2023-03-30 Thread Denny Lee
ng
>>>> list because we didn't set up any rule here yet.
>>>>
>>>> To Xiao. I understand what you mean. That's the reason why I added
>>>> Matei from your side.
>>>> > I did not see an objection from the ASF board.
>>>>
>>>> There is on-going discussion about the communication channels outside
>>>> ASF email which is specifically concerning Slack.
>>>> Please hold on any official action for this topic. We will know how to
>>>> support it seamlessly.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Mar 30, 2023 at 9:21 AM Xiao Li  wrote:
>>>>
>>>>> Hi, Dongjoon,
>>>>>
>>>>> The other communities (e.g., Pinot, Druid, Flink) created their own
>>>>> Slack workspaces last year. I did not see an objection from the ASF board.
>>>>> At the same time, Slack workspaces are very popular and useful in most
>>>>> non-ASF open source communities. TBH, we are kind of late. I think we can
>>>>> do the same in our community?
>>>>>
>>>>> We can follow the guide when the ASF has an official process for ASF
>>>>> archiving. Since our PMC are the owner of the slack workspace, we can make
>>>>> a change based on the policy. WDYT?
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> Dongjoon Hyun  于2023年3月30日周四 09:03写道:
>>>>>
>>>>>> Hi, Xiao and all.
>>>>>>
>>>>>> (cc Matei)
>>>>>>
>>>>>> Please hold on the vote.
>>>>>>
>>>>>> There is a concern expressed by ASF board because recent Slack
>>>>>> activities created an isolated silo outside of ASF mailing list archive.
>>>>>>
>>>>>> We need to establish a way to embrace it back to ASF archive before
>>>>>> starting anything official.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> + @d...@spark.apache.org 
>>>>>>>
>>>>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>>>>>>> Flink) have created their own dedicated Slack workspaces for faster
>>>>>>> communication. We can do the same in Apache Spark. The Slack workspace 
>>>>>>> will
>>>>>>> be maintained by the Apache Spark PMC. I propose to initiate a vote for 
>>>>>>> the
>>>>>>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>>>>>>
>>>>>>>> I created one at slack called pyspark
>>>>>>>>
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>>> Palantir Technologies Limited
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 good idea, I d like to join as well.
>>>>>>>>>
>>>>>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai 
>>>>>>>>> a écrit :
>>>>>>>>>
>>>>>>>>>> Please let us know when the channel is created. I'd like to join
>>>>>>>>>> :)
>>>>>>>>>>
>>>>>>>>>> Thank You & Best Regards
>>>>>>>>>> Winston Lai
>>>>>>>>>> --
>>>>>>>>>> *From:* Denny Lee 
>>>>>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>>>>>>> *To:* Hyukjin Kwon 
>>>>>>>>>> *Cc:* keen ; user@spark.apache.org <
>>>>>>>>>> user@spark.apache.org>
>>>>>>>>>> *Subject:* Re: Slack for PySpark users
>>>>>>>>>>
>>>>>>>>>> +1 I think this is a great idea!
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Yeah, actually I think we should better have a slack channel so
>>>>>>>>>> we can easily discuss with users and developers.
>>>>>>>>>>
>>>>>>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>> I really like *Slack *as communication channel for a tech
>>>>>>>>>> community.
>>>>>>>>>> There is a Slack workspace for *delta lake users* (
>>>>>>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>>>>>>> I was wondering if there is something similar for PySpark users.
>>>>>>>>>>
>>>>>>>>>> If not, would there be anything wrong with creating a new
>>>>>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that 
>>>>>>>>>> this is
>>>>>>>>>> *not* officially part of Apache Spark)?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Asma ZGOLLI
>>>>>>>>>
>>>>>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4
>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail=g>,
>> 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: What is the range of the PageRank value of graphx

2023-03-28 Thread lee
That is, every pagerank value has no relationship to 1, right? As long as we 
focus on the size of each pagerank value in Graphx, we don't need to focus on 
the range, is that right?


| |
李杰
|
|
leedd1...@163.com
|
 Replied Message 
| From | Sean Owen |
| Date | 3/28/2023 22:33 |
| To | lee |
| Cc | user@spark.apache.org |
| Subject | Re: What is the range of the PageRank value of graphx |
From the docs:


 * Note that this is not the "normalized" PageRank and as a consequence pages 
that have no
 * inlinks will have a PageRank of alpha. In particular, the pageranks may have 
some values
 * greater than 1.



On Tue, Mar 28, 2023 at 9:11 AM lee  wrote:

When I calculate pagerank using HugeGraph, each pagerank value is less than 1, 
and the total of pageranks is 1. However, the PageRank value of graphx is often 
greater than 1, so what is the range of the PageRank value of graphx?




||
李杰
|
|
leedd1...@163.com
|

What is the range of the PageRank value of graphx

2023-03-28 Thread lee
When I calculate pagerank using HugeGraph, each pagerank value is less than 1, 
and the total of pageranks is 1. However, the PageRank value of graphx is often 
greater than 1, so what is the range of the PageRank value of graphx?




||
李杰
|
|
leedd1...@163.com
|

Re: Slack for PySpark users

2023-03-27 Thread Denny Lee
+1 I think this is a great idea!

On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon  wrote:

> Yeah, actually I think we should better have a slack channel so we can
> easily discuss with users and developers.
>
> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>
>> Hi all,
>> I really like *Slack *as communication channel for a tech community.
>> There is a Slack workspace for *delta lake users* (
>> https://go.delta.io/slack) that I enjoy a lot.
>> I was wondering if there is something similar for PySpark users.
>>
>> If not, would there be anything wrong with creating a new Slack workspace
>> for PySpark users? (when explicitly mentioning that this is *not*
>> officially part of Apache Spark)?
>>
>> Cheers
>> Martin
>>
>


Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
What we can do is get into the habit of compiling the list on LinkedIn but
making sure this list is shared and broadcast here, eh?!

As well, when we broadcast the videos, we can do this using zoom/jitsi/
riverside.fm as well as simulcasting this on LinkedIn. This way you can
view directly on the former without ever logging in with a user ID.

HTH!!

On Wed, Mar 15, 2023 at 4:30 PM Mich Talebzadeh 
wrote:

> Understood Nitin It would be wrong to act against one's conviction. I am
> sure we can find a way around providing the contents
>
> Regards
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Mar 2023 at 22:34, Nitin Bhansali 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for your prompt response ... much appreciated. I know how to and
>> can create login IDs on such sites but I had taken conscious decision some
>> 20 years ago ( and i will be going against my principles) not to be on such
>> sites. Hence I had asked for is there any other way I can join/view
>> recording of webinar.
>>
>> Anyways not to worry.
>>
>> Thanks & Regards
>>
>> Nitin.
>>
>>
>> On Wednesday, 15 March 2023 at 20:37:55 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi Nitin,
>>
>> Linkedin is more of a professional media.  FYI, I am only a member of
>> Linkedin, no facebook, etc.There is no reason for you NOT to create a
>> profile for yourself  in linkedin :)
>>
>>
>> https://www.linkedin.com/help/linkedin/answer/a1338223/sign-up-to-join-linkedin?lang=en
>>
>> see you there as well.
>>
>> Best of luck.
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead,
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali 
>> wrote:
>>
>> Hello Mich,
>>
>> My apologies  ...  but I am not on any of such social/professional sites?
>> Any other way to access such webinars/classes?
>>
>> Thanks & Regards
>> Nitin.
>>
>> On Wednesday, 15 March 2023 at 18:26:51 GMT, Denny Lee <
>> denny.g@gmail.com> wrote:
>>
>>
>> Thanks Mich for tackling this!  I encourage everyone to add to the list
>> so we can have a comprehensive list of topics, eh?!
>>
>> On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
>> wrote:
>>
>> Hi all,
>>
>> Thanks to @Denny Lee   to give access to
>>
>> https://www.linkedin.com/company/apachespark/
>>
>> and contribution from @asma zgolli 
>>
>> You will see my post at the bottom. Please add anything else on topics to
>> the list as a comment.
>>
>> We will then put them together in an article perhaps. Comments and
>> contributions are welcome.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead,
>> Palantir Technologies Limited
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
Thanks Mich for tackling this!  I encourage everyone to add to the list so
we can have a comprehensive list of topics, eh?!

On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
wrote:

> Hi all,
>
> Thanks to @Denny Lee   to give access to
>
> https://www.linkedin.com/company/apachespark/
>
> and contribution from @asma zgolli 
>
> You will see my post at the bottom. Please add anything else on topics to
> the list as a comment.
>
> We will then put them together in an article perhaps. Comments and
> contributions are welcome.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead,
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh 
> wrote:
>
>> Hi Denny,
>>
>> That Apache Spark Linkedin page
>> https://www.linkedin.com/company/apachespark/ looks fine. It also allows
>> a wider audience to benefit from it.
>>
>> +1 for me
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>>
>>> In the past, we've been using the Apache Spark LinkedIn page
>>> <https://www.linkedin.com/company/apachespark/> and group to broadcast
>>> these type of events - if you're cool with this?  Or we could go through
>>> the process of submitting and updating the current
>>> https://spark.apache.org or request to leverage the original Spark
>>> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>>>  WDYT?
>>>
>>> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well that needs to be created first for this purpose. The appropriate
>>>> name etc. to be decided. Maybe @Denny Lee   can
>>>> facilitate this as he offered his help.
>>>>
>>>>
>>>> cheers
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>>>>
>>>>> Hello Mich,
>>>>>
>>>>> Can you please provide the link for the confluence page?
>>>>>
>>>>> Many thanks
>>>>> Asma
>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>
>>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> a écrit :
>>>>>
>>>>>> Apologies I missed the list.
>>>>>>
>>>>>> To move forward I selected these topics from the thread "Online
>>>>>> classes for spark topics".
>>>>>>
>>>>>> To take this further I propose a confluence page to be seup.
>>>>>>
>>>>>>
>>>>>>1. Spark UI
>>>>>>2. Dynam

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
In the past, we've been using the Apache Spark LinkedIn page
<https://www.linkedin.com/company/apachespark/> and group to broadcast
these type of events - if you're cool with this?  Or we could go through
the process of submitting and updating the current https://spark.apache.org
or request to leverage the original Spark confluence page
<https://cwiki.apache.org/confluence/display/SPARK>.WDYT?

On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
wrote:

> Well that needs to be created first for this purpose. The appropriate name
> etc. to be decided. Maybe @Denny Lee   can
> facilitate this as he offered his help.
>
>
> cheers
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>
>> Hello Mich,
>>
>> Can you please provide the link for the confluence page?
>>
>> Many thanks
>> Asma
>> Ph.D. in Big Data - Applied Machine Learning
>>
>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
>> a écrit :
>>
>>> Apologies I missed the list.
>>>
>>> To move forward I selected these topics from the thread "Online classes
>>> for spark topics".
>>>
>>> To take this further I propose a confluence page to be seup.
>>>
>>>
>>>1. Spark UI
>>>2. Dynamic allocation
>>>3. Tuning of jobs
>>>4. Collecting spark metrics for monitoring and alerting
>>>5.  For those who prefer to use Pandas API on Spark since the
>>>release of Spark 3.2, What are some important notes for those users? For
>>>example, what are the additional factors affecting the Spark performance
>>>using Pandas API on Spark? How to tune them in addition to the 
>>> conventional
>>>Spark tuning methods applied to Spark SQL users.
>>>6. Spark internals and/or comparing spark 3 and 2
>>>7. Spark Streaming & Spark Structured Streaming
>>>8. Spark on notebooks
>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>10. Spark on k8s
>>>
>>> Opinions and how to is welcome
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Hi guys
>>>>
>>>> To move forward I selected these topics from the thread "Online classes
>>>> for spark topics".
>>>>
>>>> To take this further I propose a confluence page to be seup.
>>>>
>>>> Opinions and how to is welcome
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>
>>
>>


Re: Online classes for spark topics

2023-03-12 Thread Denny Lee
Looks like we have some good topics here - I'm glad to help with setting up
the infrastructure to broadcast if it helps?

On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani 
wrote:

> I am happy to be a part of this discussion as well.
>
> Regards,
> Neeraj
>
> On Wed, 8 Mar 2023 at 22:41, Winston Lai  wrote:
>
>> +1, any webinar on Spark related topic is appreciated 
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* asma zgolli 
>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>> *To:* karan alang 
>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
>> ashok34...@yahoo.com>; User 
>> *Subject:* Re: Online classes for spark topics
>>
>> +1
>>
>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>> écrit :
>>
>> +1 .. I'm happy to be part of these discussions as well !
>>
>>
>>
>>
>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> I guess I can schedule this work over a course of time. I for myself can
>> contribute plus learn from others.
>>
>> So +1 for me.
>>
>> Let us see if anyone else is interested.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>> wrote:
>>
>>
>> Hello Mich.
>>
>> Greetings. Would you be able to arrange for Spark Structured Streaming
>> learning webinar.?
>>
>> This is something I haven been struggling with recently. it will be very
>> helpful.
>>
>> Thanks and Regard
>>
>> AK
>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> This might  be a worthwhile exercise on the assumption that the
>> contributors will find the time and bandwidth to chip in so to speak.
>>
>> I am sure there are many but on top of my head I can think of Holden
>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>> experienced.
>>
>> Anyone else 樂
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>  wrote:
>>
>> Hello gurus,
>>
>> Does Spark arranges online webinars for special topics like Spark on K8s,
>> data science and Spark Structured Streaming?
>>
>> I would be most grateful if experts can share their experience with
>> learners with intermediate knowledge like myself. Hopefully we will find
>> the practical experiences told valuable.
>>
>> Respectively,
>>
>> AK
>>
>>
>>
>>
>


Re: Online classes for spark topics

2023-03-08 Thread Denny Lee
We used to run Spark webinars on the Apache Spark LinkedIn group
 but
honestly the turnout was pretty low.  We had dove into various features.
If there are particular topics that. you would like to discuss during a
live session, please let me know and we can try to restart them.  HTH!

On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:

> +1
>
> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai  wrote:
>
>> +1, any webinar on Spark related topic is appreciated 
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* asma zgolli 
>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>> *To:* karan alang 
>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
>> ashok34...@yahoo.com>; User 
>> *Subject:* Re: Online classes for spark topics
>>
>> +1
>>
>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>> écrit :
>>
>> +1 .. I'm happy to be part of these discussions as well !
>>
>>
>>
>>
>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> I guess I can schedule this work over a course of time. I for myself can
>> contribute plus learn from others.
>>
>> So +1 for me.
>>
>> Let us see if anyone else is interested.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>> wrote:
>>
>>
>> Hello Mich.
>>
>> Greetings. Would you be able to arrange for Spark Structured Streaming
>> learning webinar.?
>>
>> This is something I haven been struggling with recently. it will be very
>> helpful.
>>
>> Thanks and Regard
>>
>> AK
>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> This might  be a worthwhile exercise on the assumption that the
>> contributors will find the time and bandwidth to chip in so to speak.
>>
>> I am sure there are many but on top of my head I can think of Holden
>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>> experienced.
>>
>> Anyone else 樂
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>  wrote:
>>
>> Hello gurus,
>>
>> Does Spark arranges online webinars for special topics like Spark on K8s,
>> data science and Spark Structured Streaming?
>>
>> I would be most grateful if experts can share their experience with
>> learners with intermediate knowledge like myself. Hopefully we will find
>> the practical experiences told valuable.
>>
>> Respectively,
>>
>> AK
>>
>>
>>
>>
>


Re: Prometheus with spark

2022-10-27 Thread Denny Lee
Hi Raja,

A little atypical way to respond to your question - please check out the
most recent Spark AMA where we discuss this:
https://www.linkedin.com/posts/apachespark_apachespark-ama-committers-activity-6989052811397279744-jpWH?utm_source=share_medium=member_ios

HTH!
Denny



On Tue, Oct 25, 2022 at 09:16 Raja bhupati 
wrote:

> We have use case where we would like process Prometheus metrics data with
> spark
>
> On Tue, Oct 25, 2022, 19:49 Jacek Laskowski  wrote:
>
>> Hi Raj,
>>
>> Do you want to do the following?
>>
>> spark.read.format("prometheus").load...
>>
>> I haven't heard of such a data source / format before.
>>
>> What would you like it for?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, Oct 21, 2022 at 6:12 PM Raj ks  wrote:
>>
>>> Hi Team,
>>>
>>>
>>> We wanted to query Prometheus data with spark. Any suggestions will
>>> be appreciated
>>>
>>> Searched for documents but did not got any prompt one
>>>
>>


Re: spark thrift server as hive on spark running on kubernetes, and more.

2021-12-14 Thread Kidong Lee
Hi all,

Recently I have written a spark operator to deploy spark applications onto
Kubernetes using custom resources. See DataRoaster spark operator for more
details:
https://github.com/cloudcheflabs/dataroaster/tree/master/operators/spark

Spark thrift server can be more easier deployed on Kubernetes in cluster
mode using DataRoaster spark operator. Please see my blog how to do so:
https://t.co/T3SXG0mZFB

Cheers,

- Kidong Lee




2021년 9월 10일 (금) 오전 8:38, Kidong Lee 님이 작성:

> Hi,
>
> Recently, I have open-sourced a tool called DataRoaster(
> https://github.com/cloudcheflabs/dataroaster) to provide data platforms
> running on kubernetes with ease.
> In particular, with DataRoaster, you can deploy spark thrift server on
> kubernetes easily, which is originated from my blog of
> https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1.
> In addition to spark thrift server as hive on spark, there are several
> components provided by DataRoaster, for instance, hive metastore, trino,
> redash, jupyterhub, kafka.
>
> To use DataRoaster,
> - visit https://github.com/cloudcheflabs/dataroaster .
> - see also demo
> https://github.com/cloudcheflabs/dataroaster#dataroaster-demo
>
> Thank you.
>
> - Kidong Lee.
>
-- 
*이기동 *
*Kidong Lee*

Web Site: http://www.cloudchef-labs.com/
Email: mykid...@gmail.com
Mobile: +82 10 4981 7297
<http://www.cloudchef-labs.com/>


spark thrift server as hive on spark running on kubernetes, and more.

2021-09-09 Thread Kidong Lee
Hi,

Recently, I have open-sourced a tool called DataRoaster(
https://github.com/cloudcheflabs/dataroaster) to provide data platforms
running on kubernetes with ease.
In particular, with DataRoaster, you can deploy spark thrift server on
kubernetes easily, which is originated from my blog of
https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1.
In addition to spark thrift server as hive on spark, there are several
components provided by DataRoaster, for instance, hive metastore, trino,
redash, jupyterhub, kafka.

To use DataRoaster,
- visit https://github.com/cloudcheflabs/dataroaster .
- see also demo
https://github.com/cloudcheflabs/dataroaster#dataroaster-demo

Thank you.

- Kidong Lee.


Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Denny Lee
Hi Karan,

You may want to ping Databricks Help  or
Forums  as this is a Databricks
specific question.  I'm a little surprised that a Databricks cluster would
take a long time to create so it may be best to utilize these forums to
grok the cause.

HTH!
Denny


Sent via Superhuman 


On Mon, Aug 16, 2021 at 11:10 PM, karan alang  wrote:

> Hello - i've been using the Databricks notebook(for pyspark or scala/spark
> development), and recently have had issues wherein the cluster creation
> takes a long time to get created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>


Re: Append to an existing Delta Lake using structured streaming

2021-07-21 Thread Denny Lee
Including the Delta Lake Users and Developers DL to help out.

Saying this, could you clarify how data is not being added?  By any chance
do you have any code samples to recreate this?

Sent via Superhuman 


On Wed, Jul 21, 2021 at 2:49 AM,  wrote:

> Hi all,
>   I stumbled upon an interessting problem. I have an existing Deltalake
> with data recovered from a backup and would like to append to this
> Deltalake using Spark structured streaming. This does not work. Although
> the streaming job is running no data is appended.
> If I created the original file with structured streaming than appending to
> this file with a streaming job (at least with the same job) works
> flawlessly.  Did I missunderstand something here?
>
> best regards
>Eugen Wintersberger
>


Re: How to unsubscribe

2020-05-06 Thread Denny Lee
Hi Fred,

To unsubscribe, could you please email: user-unsubscr...@spark.apache.org
(for more information, please refer to
https://spark.apache.org/community.html).

Thanks!
Denny


On Wed, May 6, 2020 at 10:12 AM Fred Liu  wrote:

> Hi guys
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> *From:* Fred Liu 
> *Sent:* Wednesday, May 6, 2020 10:10 AM
> *To:* user@spark.apache.org
> *Subject:* Unsubscribe
>
>
>
> *[External E-mail]*
>
> *CAUTION: This email originated from outside the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.*
>
>
>
>
>


Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Denny Lee
There are a number of really good datasets already available including (but
not limited to):
- South Korea COVID-19 Dataset

- 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns
Hopkins CSSE 
- COVID-19 Open Research Dataset Challenge (CORD-19)


BTW, I had co-presented in a recent tech talk on Analyzing COVID-19: Can
the Data Community Help? 

In the US, there is a good resource Coronavirus in the United States:
Mapping the COVID-19 outbreak
 and
there are various global starter projects on Reddit's r/CovidProjects
.

There are a lot of good projects that we can all help individually or
together.  I would suggest to see what hospitals/academic institutions that
are doing analysis in your local region.  Even if you're analyzing public
worldwide data,  how it acts in your local region will often be different.







On Thu, Mar 26, 2020 at 12:30 PM Rajev Agarwal 
wrote:

> Actually I thought these sites exist look at John's hopkins and
> worldometers
>
> On Thu, Mar 26, 2020, 2:27 PM Zahid Rahman  wrote:
>
>>
>> "We can then donate this to WHO or others and we can make it very modular
>> though microservices etc."
>>
>> I have no interest because there are 8 million muslims locked up in their
>> home for 8 months by the Hindutwa (Indians)
>> You didn't take any notice of them.
>> Now you are locked up in your home and you want to contribute to the WHO.
>> The same WHO and you who didn't take any notice of the 8 million Kashmiri
>> Muslims.
>> The daily rapes of women and the imprisonment and torture of  men.
>>
>> Indian is the most dangerous country for women.
>>
>>
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>>
>> On Thu, 26 Mar 2020 at 14:53, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks but nobody claimed we can fix it. However, we can all contribute
>>> to it. When it utilizes the cloud then it become a global digitization
>>> issue.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 26 Mar 2020 at 14:43, Laurent Bastien Corbeil <
>>> bastiencorb...@gmail.com> wrote:
>>>
 People in tech should be more humble and admit this is not something
 they can fix. There's already plenty of visualizations, dashboards etc
 showing the spread of the virus. This is not even a big data problem, so
 Spark would have limited use.

 On Thu, Mar 26, 2020 at 10:37 AM Sol Rodriguez 
 wrote:

> IMO it's not about technology, it's about data... if we don't have
> access to the data there's no point throwing "microservices" and "kafka" 
> at
> the problem. You might find that the most effective analysis might be
> delivered through an excel sheet ;)
> So before technology I'd suggest to get access to sources and then
> figure out how to best exploit them and deliver the information to the
> right people
>
> On Thu, Mar 26, 2020 at 2:29 PM Chenguang He 
> wrote:
>
>> Have you taken a look at this (
>> https://coronavirus.1point3acres.com/en/test  )?
>>
>> They have a visualizer with a very basic analysis of the outbreak.
>>
>> On Thu, Mar 26, 2020 at 8:54 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks.
>>>
>>> Agreed, computers are not the end but means to an end. We all have
>>> to start from somewhere. It all helps.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property which 
>>> may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author 

unsubscribe

2019-12-30 Thread Jaebin Lee
unsubscribe


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Denny Lee
+1

On Fri, May 31, 2019 at 17:58 Holden Karau  wrote:

> +1
>
> On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:
>
>> +1 and the draft sounds good
>>
>> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>>
>>> Here is the draft announcement:
>>>
>>> ===
>>> Plan for dropping Python 2 support
>>>
>>> As many of you already knew, Python core development team and many
>>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>>> compatibility is an increasing burden and it essentially limits the use of
>>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>>> coming, we plan to eventually drop Python 2 support as well. The current
>>> plan is as follows:
>>>
>>> * In the next major release in 2019, we will deprecate Python 2 support.
>>> PySpark users will see a deprecation warning if Python 2 is used. We will
>>> publish a migration guide for PySpark users to migrate to Python 3.
>>> * We will drop Python 2 support in a future release in 2020, after
>>> Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is
>>> used.
>>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>>> releases will continue supporting Python 2. However, after Python 2 EOL, we
>>> might not take patches that are specific to Python 2.
>>> ===
>>>
>>> Sean helped make a pass. If it looks good, I'm going to upload it to
>>> Spark website and announce it here. Let me know if you think we should do a
>>> VOTE instead.
>>>
>>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>>>
 I created https://issues.apache.org/jira/browse/SPARK-27884 to track
 the work.

 On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
 wrote:

> We don’t usually reference a future release on website
>
> > Spark website and state that Python 2 is deprecated in Spark 3.0
>
> I suspect people will then ask when is Spark 3.0 coming out then.
> Might need to provide some clarity on that.
>

 We can say the "next major release in 2019" instead of Spark 3.0. Spark
 3.0 timeline certainly requires a new thread to discuss.


>
>
> --
> *From:* Reynold Xin 
> *Sent:* Thursday, May 30, 2019 12:59:14 AM
> *To:* shane knapp
> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
> Fen; Xiangrui Meng; dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1 on Xiangrui’s plan.
>
> On Thu, May 30, 2019 at 7:55 AM shane knapp 
> wrote:
>
>> I don't have a good sense of the overhead of continuing to support
>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>
>>> from the build/test side, it will actually be pretty easy to
>> continue support for python2.7 for spark 2.x as the feature sets won't be
>> expanding.
>>
>
>> that being said, i will be cracking a bottle of champagne when i can
>> delete all of the ansible and anaconda configs for python2.x.  :)
>>
>
 On the development side, in a future release that drops Python 2
 support we can remove code that maintains python 2/3 compatibility and
 start using python 3 only features, which is also quite exciting.


>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


unsubscribe

2019-03-28 Thread Byron Lee
unsubscribe


unsubscribe

2019-03-11 Thread Byron Lee



Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Sorry,the code is too long,it is simple to say 
look at the photo

 

i define a arrayBuffer ,there are "1 2",  '' 2 3" ," 4 5" in it ,I want to
save in hdfs ,so i make it to RDD,
sc. pallelize(arraybuffeer)
but when in idea,i use println(_),the value is right,but in distributed
there is nothing 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Hi all,
In  my experiment program,I used spark Graphx,
when running on the Idea in windows,the result is right,
but when runing  on the linux distributed cluster,the result in hdfs is
empty,
why?how to solve?

 

Thanks!
Jian Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Job hangs in blocked task in final parquet write stage

2018-12-11 Thread Conrad Lee
So based on many more runs of this job I've come to the conclusion that a
workaround to this error is to

   - decrease the amount of data written in each partition, or
   - increase the amount of memory available to each executor

I still don't know what the root cause of the issue is.

On Tue, Dec 4, 2018 at 9:45 AM Conrad Lee  wrote:

> Yeah, probably increasing the memory or increasing the number of output
> partitions would help.  However increasing memory available to each
> executor would add expense.  I want to keep the number of partitions low so
> that each parquet file turns out to be around 128 mb, which is best
> practice for long-term storage and use with other systems like presto.
>
> This feels like a bug due to the flakey nature of the failure -- also,
> usually when the memory gets too low the executor is killed or errors out
> and I get one of the typical Spark OOM error codes.  When I run the same
> job with the same resources sometimes this job succeeds, and sometimes it
> fails.
>
> On Mon, Dec 3, 2018 at 5:19 PM Christopher Petrino <
> christopher.petr...@gmail.com> wrote:
>
>> Depending on the size of your data set and how how many resources you
>> have (num-executors, executor instances, number of nodes) I'm inclined to
>> suspect that issue is related to reduction of partitions from thousands to
>> 96; I could be misguided but given the details I have I would consider
>> testing an approach to understand the behavior if the final stage operates
>> at different number of partitions.
>>
>> On Mon, Dec 3, 2018 at 2:48 AM Conrad Lee  wrote:
>>
>>> Thanks for the thoughts.  While the beginning of the job deals with lots
>>> of files in the first stage, they're first coalesced down into just a few
>>> thousand partitions.  The part of the job that's failing is the reduce-side
>>> of a dataframe.sort() that writes output to HDFS.  This last stage has only
>>> 96 tasks and the partitions are well balanced.  I'm not using a
>>> `partitionBy` option on the dataframe writer.
>>>
>>> On Fri, Nov 30, 2018 at 8:14 PM Christopher Petrino <
>>> christopher.petr...@gmail.com> wrote:
>>>
>>>> The reason I ask is because I've had some unreliability caused by over
>>>> stressing the HDFS. Do you know the number of partitions when these actions
>>>> are being. i.e. if you have 1,000,000 files being read you may have
>>>> 1,000,000 partitions which may cause HDFS stress. Alternatively if you have
>>>> 1 large file, say 100 GB, you may 1 partition which would not fit in memory
>>>> and may cause writes to disk. I imagine it may be flaky because you are
>>>> doing some action like a groupBy somewhere and depending on how the data
>>>> was read certain groups will be in certain partitions; I'm not sure if
>>>> reads on files are deterministic, I suspect they are not
>>>>
>>>> On Fri, Nov 30, 2018 at 2:08 PM Conrad Lee  wrote:
>>>>
>>>>> I'm loading the data using the dataframe reader from parquet files
>>>>> stored on local HDFS.  The stage of the job that fails is not the stage
>>>>> that does this.  The stage of the job that fails is one that reads a 
>>>>> sorted
>>>>> dataframe from the last shuffle and performs the final write to parquet on
>>>>> local HDFS.
>>>>>
>>>>> On Fri, Nov 30, 2018 at 4:02 PM Christopher Petrino <
>>>>> christopher.petr...@gmail.com> wrote:
>>>>>
>>>>>> How are you loading the data?
>>>>>>
>>>>>> On Fri, Nov 30, 2018 at 2:26 AM Conrad Lee 
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the suggestions.  Here's an update that responds to some
>>>>>>> of the suggestions/ideas in-line:
>>>>>>>
>>>>>>> I ran into problems using 5.19 so I referred to 5.17 and it resolved
>>>>>>>> my issues.
>>>>>>>
>>>>>>>
>>>>>>> I tried EMR 5.17.0 and the problem still sometimes occurs.
>>>>>>>
>>>>>>>  try running a coalesce. Your data may have grown and is defaulting
>>>>>>>> to a number of partitions that causing unnecessary overhead
>>>>>>>>
>>>>>>> Well I don't think it's that because this problem occurs flakily.
>>>>>>> That is, if the job hangs I can kill it and re-run it and it works fine 
>>>>>

Re: Job hangs in blocked task in final parquet write stage

2018-12-04 Thread Conrad Lee
Yeah, probably increasing the memory or increasing the number of output
partitions would help.  However increasing memory available to each
executor would add expense.  I want to keep the number of partitions low so
that each parquet file turns out to be around 128 mb, which is best
practice for long-term storage and use with other systems like presto.

This feels like a bug due to the flakey nature of the failure -- also,
usually when the memory gets too low the executor is killed or errors out
and I get one of the typical Spark OOM error codes.  When I run the same
job with the same resources sometimes this job succeeds, and sometimes it
fails.

On Mon, Dec 3, 2018 at 5:19 PM Christopher Petrino <
christopher.petr...@gmail.com> wrote:

> Depending on the size of your data set and how how many resources you have
> (num-executors, executor instances, number of nodes) I'm inclined to
> suspect that issue is related to reduction of partitions from thousands to
> 96; I could be misguided but given the details I have I would consider
> testing an approach to understand the behavior if the final stage operates
> at different number of partitions.
>
> On Mon, Dec 3, 2018 at 2:48 AM Conrad Lee  wrote:
>
>> Thanks for the thoughts.  While the beginning of the job deals with lots
>> of files in the first stage, they're first coalesced down into just a few
>> thousand partitions.  The part of the job that's failing is the reduce-side
>> of a dataframe.sort() that writes output to HDFS.  This last stage has only
>> 96 tasks and the partitions are well balanced.  I'm not using a
>> `partitionBy` option on the dataframe writer.
>>
>> On Fri, Nov 30, 2018 at 8:14 PM Christopher Petrino <
>> christopher.petr...@gmail.com> wrote:
>>
>>> The reason I ask is because I've had some unreliability caused by over
>>> stressing the HDFS. Do you know the number of partitions when these actions
>>> are being. i.e. if you have 1,000,000 files being read you may have
>>> 1,000,000 partitions which may cause HDFS stress. Alternatively if you have
>>> 1 large file, say 100 GB, you may 1 partition which would not fit in memory
>>> and may cause writes to disk. I imagine it may be flaky because you are
>>> doing some action like a groupBy somewhere and depending on how the data
>>> was read certain groups will be in certain partitions; I'm not sure if
>>> reads on files are deterministic, I suspect they are not
>>>
>>> On Fri, Nov 30, 2018 at 2:08 PM Conrad Lee  wrote:
>>>
>>>> I'm loading the data using the dataframe reader from parquet files
>>>> stored on local HDFS.  The stage of the job that fails is not the stage
>>>> that does this.  The stage of the job that fails is one that reads a sorted
>>>> dataframe from the last shuffle and performs the final write to parquet on
>>>> local HDFS.
>>>>
>>>> On Fri, Nov 30, 2018 at 4:02 PM Christopher Petrino <
>>>> christopher.petr...@gmail.com> wrote:
>>>>
>>>>> How are you loading the data?
>>>>>
>>>>> On Fri, Nov 30, 2018 at 2:26 AM Conrad Lee  wrote:
>>>>>
>>>>>> Thanks for the suggestions.  Here's an update that responds to some
>>>>>> of the suggestions/ideas in-line:
>>>>>>
>>>>>> I ran into problems using 5.19 so I referred to 5.17 and it resolved
>>>>>>> my issues.
>>>>>>
>>>>>>
>>>>>> I tried EMR 5.17.0 and the problem still sometimes occurs.
>>>>>>
>>>>>>  try running a coalesce. Your data may have grown and is defaulting
>>>>>>> to a number of partitions that causing unnecessary overhead
>>>>>>>
>>>>>> Well I don't think it's that because this problem occurs flakily.
>>>>>> That is, if the job hangs I can kill it and re-run it and it works fine 
>>>>>> (on
>>>>>> the same hardware and with the same memory settings).  I'm not getting 
>>>>>> any
>>>>>> OOM errors.
>>>>>>
>>>>>> On a related note: the job is spilling to disk. I see messages like
>>>>>> this:
>>>>>>
>>>>>> 18/11/29 21:40:06 INFO UnsafeExternalSorter: Thread 156 spilling sort
>>>>>>> data of 912.0 MB to disk (3  times so far)
>>>>>>
>>>>>>
>>>>>>  This occurs in both successful and unsuccessful runs though.  I've
>>>>>> checked the disk

Re: Job hangs in blocked task in final parquet write stage

2018-11-29 Thread Conrad Lee
Thanks, I'll try using 5.17.0.

For anyone trying to debug this problem in the future: In other jobs that
hang in the same manner, the thread dump didn't have any blocked threads,
so that might be a red herring.

On Wed, Nov 28, 2018 at 4:34 PM Christopher Petrino <
christopher.petr...@gmail.com> wrote:

> I ran into problems using 5.19 so I referred to 5.17 and it resolved my
> issues.
>
> On Wed, Nov 28, 2018 at 2:48 AM Conrad Lee  wrote:
>
>> Hello Vadim,
>>
>> Interesting.  I've only been running this job at scale for a couple weeks
>> so I can't say whether this is related to recent EMR changes.
>>
>> Much of the EMR-specific code for spark has to do with writing files to
>> s3.  In this case I'm writing files to the cluster's HDFS though so my
>> sense is that this is a spark issue, not an EMR (but I'm not sure).
>>
>> Conrad
>>
>> On Tue, Nov 27, 2018 at 5:21 PM Vadim Semenov 
>> wrote:
>>
>>> Hey Conrad,
>>>
>>> has it started happening recently?
>>>
>>> We recently started having some sporadic problems with drivers on EMR
>>> when it gets stuck, up until two weeks ago everything was fine.
>>> We're trying to figure out with the EMR team where the issue is coming
>>> from.
>>> On Tue, Nov 27, 2018 at 6:29 AM Conrad Lee  wrote:
>>> >
>>> > Dear spark community,
>>> >
>>> > I'm running spark 2.3.2 on EMR 5.19.0.  I've got a job that's hanging
>>> in the final stage--the job usually works, but I see this hanging behavior
>>> in about one out of 50 runs.
>>> >
>>> > The second-to-last stage sorts the dataframe, and the final stage
>>> writes the dataframe to HDFS.
>>> >
>>> > Here you can see the executor logs, which indicate that it has
>>> finished processing the task.
>>> >
>>> > Here you can see the thread dump from the executor that's hanging.
>>> Here's the text of the blocked thread.
>>> >
>>> > I tried to work around this problem by enabling speculation, but
>>> speculative execution never takes place.  I don't know why.
>>> >
>>> > Can anyone here help me?
>>> >
>>> > Thanks,
>>> > Conrad
>>>
>>>
>>>
>>> --
>>> Sent from my iPhone
>>>
>>


Re: Job hangs in blocked task in final parquet write stage

2018-11-27 Thread Conrad Lee
Hello Vadim,

Interesting.  I've only been running this job at scale for a couple weeks
so I can't say whether this is related to recent EMR changes.

Much of the EMR-specific code for spark has to do with writing files to
s3.  In this case I'm writing files to the cluster's HDFS though so my
sense is that this is a spark issue, not an EMR (but I'm not sure).

Conrad

On Tue, Nov 27, 2018 at 5:21 PM Vadim Semenov  wrote:

> Hey Conrad,
>
> has it started happening recently?
>
> We recently started having some sporadic problems with drivers on EMR
> when it gets stuck, up until two weeks ago everything was fine.
> We're trying to figure out with the EMR team where the issue is coming
> from.
> On Tue, Nov 27, 2018 at 6:29 AM Conrad Lee  wrote:
> >
> > Dear spark community,
> >
> > I'm running spark 2.3.2 on EMR 5.19.0.  I've got a job that's hanging in
> the final stage--the job usually works, but I see this hanging behavior in
> about one out of 50 runs.
> >
> > The second-to-last stage sorts the dataframe, and the final stage writes
> the dataframe to HDFS.
> >
> > Here you can see the executor logs, which indicate that it has finished
> processing the task.
> >
> > Here you can see the thread dump from the executor that's hanging.
> Here's the text of the blocked thread.
> >
> > I tried to work around this problem by enabling speculation, but
> speculative execution never takes place.  I don't know why.
> >
> > Can anyone here help me?
> >
> > Thanks,
> > Conrad
>
>
>
> --
> Sent from my iPhone
>


Re: Job hangs in blocked task in final parquet write stage

2018-11-27 Thread Conrad Lee
Dear spark community,

I'm running spark 2.3.2 on EMR 5.19.0.  I've got a job that's hanging in
the final stage--the job usually works, but I see this hanging behavior in
about one out of 50 runs.

The second-to-last stage sorts the dataframe, and the final stage writes
the dataframe to HDFS.

Here  you can see the executor logs, which
indicate that it has finished processing the task.

Here  you can see the thread dump from the
executor that's hanging.  Here's the text of
the blocked thread.

I tried to work around this problem by enabling speculation, but
speculative execution never takes place.  I don't know why.

Can anyone here help me?

Thanks,
Conrad


Re: Does Pyspark Support Graphx?

2018-02-18 Thread Denny Lee
Note the --packages option works for both PySpark and Spark (Scala).  For
the SparkLauncher class, you should be able to include packages ala:

spark.addSparkArg("--packages", "graphframes:0.5.0-spark2.0-s_2.11")


On Sun, Feb 18, 2018 at 3:30 PM xiaobo <guxiaobo1...@qq.com> wrote:

> Hi Denny,
> The pyspark script uses the --packages option to load graphframe library,
> what about the SparkLauncher class?
>
>
>
> -- Original --
> *From:* Denny Lee <denny.g@gmail.com>
> *Date:* Sun,Feb 18,2018 11:07 AM
> *To:* 94035420 <guxiaobo1...@qq.com>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: Does Pyspark Support Graphx?
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420 <guxiaobo1...@qq.com> wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>


Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
Most likely not as most of the effort is currently on GraphFrames  - a
great blog post on the what GraphFrames offers can be found at:
https://databricks.com/blog/2016/03/03/introducing-graphframes.html.   Is
there a particular scenario or situation that you're addressing that
requires GraphX vs. GraphFrames?

On Sat, Feb 17, 2018 at 8:26 PM xiaobo <guxiaobo1...@qq.com> wrote:

> Thanks Denny, will it be supported in the near future?
>
>
>
> -- Original ------
> *From:* Denny Lee <denny.g@gmail.com>
> *Date:* Sun,Feb 18,2018 11:05 AM
> *To:* 94035420 <guxiaobo1...@qq.com>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: Does Pyspark Support Graphx?
>
> That’s correct - you can use GraphFrames though as it does support
> PySpark.
> On Sat, Feb 17, 2018 at 17:36 94035420 <guxiaobo1...@qq.com> wrote:
>
>> I can not find anything for graphx module in the python API document,
>> does it mean it is not supported yet?
>>
>


Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
That’s correct - you can use GraphFrames though as it does support PySpark.
On Sat, Feb 17, 2018 at 17:36 94035420  wrote:

> I can not find anything for graphx module in the python API document, does
> it mean it is not supported yet?
>


Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi
​


I have a few of questions about a structure of HDFS and S3 when Spark-like
loads data from two storage.


Generally, when Spark loads data from HDFS, HDFS supports data locality and
already own distributed file on datanodes, right? Spark could just process
data on workers.


What about S3? many people in this field use S3 for storage or loading data
remotely. When Spark loads data from S3 (sc.textFile('s3://...'), how all
data will be spread on Workers? Master node's responsible for this task? It
reads all data from S3, then spread the data to Worker? So it migt be a
trade-off compared to HDFS? or I got a wrong point of this
​.

​

What kind of points in S3 is better than that of HDFS?
​

​Thanks in Advanced​


Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Denny Lee
This is amazingly awesome! :)

On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
wrote:

> That's great!
>
>
>
> On 12 July 2017 at 12:41, Felix Cheung  wrote:
>
>> Awesome! Congrats!!
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
>> *To:* user@spark.apache.org
>> *Subject:* With 2.2.0 PySpark is now available for pip install from PyPI
>> :)
>>
>> Hi wonderful Python + Spark folks,
>>
>> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
>> published on PyPI (see https://pypi.python.org/pypi/pyspark /
>> https://twitter.com/holdenkarau/status/885207416173756417). This has
>> been a long time coming (previous releases included pip installable
>> artifacts that for a variety of reasons couldn't be published to PyPI). So
>> if you (or your friends) want to be able to work with PySpark locally on
>> your laptop you've got an easier path getting started (pip install pyspark).
>>
>> If you are setting up a standalone cluster your cluster will still need
>> the "full" Spark packaging, but the pip installed PySpark should be able to
>> work with YARN or an existing standalone cluster installation (of the same
>> version).
>>
>> Happy Sparking y'all!
>>
>> Holden :)
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: Spark Shell issue on HDInsight

2017-05-14 Thread Denny Lee
Sorry for the delay, you just did as I'm with the Azure CosmosDB (formerly
DocumentDB) team.  If you'd like to make it official, why not add an issue
to the GitHub repo at https://github.com/Azure/azure-documentdb-spark/issues.
HTH!

On Thu, May 11, 2017 at 9:08 PM ayan guha <guha.a...@gmail.com> wrote:

> Works for me tooyou are a life-saver :)
>
> But the question: should/how we report this to Azure team?
>
> On Fri, May 12, 2017 at 10:32 AM, Denny Lee <denny.g@gmail.com> wrote:
>
>> I was able to repro your issue when I had downloaded the jars via blob
>> but when I downloaded them as raw, I was able to get everything up and
>> running.  For example:
>>
>> wget https://github.com/Azure/azure-documentdb-spark/*blob*
>> /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-1.10.0.jar
>> wget https://github.com/Azure/azure-documentdb-spark/*blob*
>> /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>> resulted in the error:
>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel).
>> [init] error: error while loading , Error accessing
>> /home/sshuser/jars/test/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>>
>> Failed to initialize compiler: object java.lang.Object in compiler mirror
>> not found.
>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>> ** object programmatically, settings.usejavacp.value = true.
>>
>> But when running:
>> wget
>> https://github.com/Azure/azure-documentdb-spark/raw/master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-1.10.0.jar
>> wget
>> https://github.com/Azure/azure-documentdb-spark/raw/master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>> it was up and running:
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel).
>> 17/05/11 22:54:06 WARN SparkContext: Use an existing SparkContext, some
>> configuration may not take effect.
>> Spark context Web UI available at http://10.0.0.22:4040
>> Spark context available as 'sc' (master = yarn, app id =
>> application_1494248502247_0013).
>> Spark session available as 'spark'.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2.2.5.4.0-121
>>   /_/
>>
>> Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala>
>>
>> HTH!
>>
>>
>> On Wed, May 10, 2017 at 11:49 PM ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Thanks for reply, but unfortunately did not work. I am getting same
>>> error.
>>>
>>> sshuser@ed0-svochd:~/azure-spark-docdb-test$ spark-shell --jars
>>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel).
>>> [init] error: error while loading , Error accessing
>>> /home/sshuser/azure-spark-docdb-test/azure-documentdb-spark-0.0.3-SNAPSHOT.jar
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatica

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee
ala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
> at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sshuser@ed0-svochd:~/azure-spark-docdb-test$
>
>
> On Mon, May 8, 2017 at 11:50 PM, Denny Lee <denny.g@gmail.com> wrote:
>
>> This appears to be an issue with the Spark to DocumentDB connector,
>> specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
>> see if you're still getting the same error?  i.e.
>>
>> spark-shell --master yarn --jars
>> azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>>
>>
>> On Mon, May 8, 2017 at 5:01 AM ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am facing an issue while trying to use azure-document-db connector
>>> from Microsoft. Instructions/Github
>>> <https://github.com/Azure/azure-documentdb-spark/wiki/Azure-DocumentDB-Spark-Connector-User-Guide>
>>> .
>>>
>>> Error while trying to add jar in spark-shell:
>>>
>>> spark-shell --jars
>>> azure-documentdb-spark-0.0.1.jar,azure-documentdb-1.9.6.jar
>>> SPARK_MAJOR_VERSION is set to 2, using Spark2
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel).
>>> [init] error: error while loading , Error accessing
>>> /home/sshuser/azure-spark-docdb-test/v1/azure-documentdb-spark-0.0.1.jar
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>>
>>> Failed to initialize compiler: object java.lang.Object in compiler
>>> mirror not found.
>>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>> ** object programmatically, settings.usejavacp.value = true.
>>> Exception in thread "main" java.lang.NullPointerException
>>> at
>>> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
>>>

Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee
This appears to be an issue with the Spark to DocumentDB connector,
specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
see if you're still getting the same error?  i.e.

spark-shell --master yarn --jars
azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar


On Mon, May 8, 2017 at 5:01 AM ayan guha  wrote:

> Hi
>
> I am facing an issue while trying to use azure-document-db connector from
> Microsoft. Instructions/Github
> 
> .
>
> Error while trying to add jar in spark-shell:
>
> spark-shell --jars
> azure-documentdb-spark-0.0.1.jar,azure-documentdb-1.9.6.jar
> SPARK_MAJOR_VERSION is set to 2, using Spark2
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> [init] error: error while loading , Error accessing
> /home/sshuser/azure-spark-docdb-test/v1/azure-documentdb-spark-0.0.1.jar
>
> Failed to initialize compiler: object java.lang.Object in compiler mirror
> not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
>
> Failed to initialize compiler: object java.lang.Object in compiler mirror
> not found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programmatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.NullPointerException
> at
> scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
> at
> scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:896)
> at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request.headerPreamble$lzycompute(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request.headerPreamble(IMain.scala:895)
> at
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:918)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1337)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1336)
> at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
> at
> scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1336)
> at
> scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:908)
> at
> scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:1002)
> at
> scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
> at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
> at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
> at
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
> at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:68)
> at org.apache.spark.repl.Main$.main(Main.scala:51)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at
> 

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Dongjin Lee
Sumit,

I think the post below is describing the very case of you.

https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/

Regards,
Dongjin

--
Dongjin Lee

Software developer in Line+.
So interested in massive-scale machine learning.

facebook: http://www.facebook.com/dongjin.lee.kr
linkedin: http://kr.linkedin.com/in/dongjinleekr
github: http://github.com/dongjinleekr
twitter: http://www.twitter.com/dongjinleekr

On 22 Apr 2017, 5:32 AM +0900, Chawla,Sumit <sumitkcha...@gmail.com>, wrote:
> I am seeing a strange issue. I had a bad behaving slave that failed the 
> entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all 
> task retries happen on the same slave in case of failure. My expectation was 
> that task will be retried on different slave in case of failure, and chance 
> of all 8 retries to happen on same slave is very less.
>
>
> Regards
> Sumit Chawla
>


Re: Azure Event Hub with Pyspark

2017-04-20 Thread Denny Lee
As well, perhaps another option could be to use the Spark Connector to
DocumentDB (https://github.com/Azure/azure-documentdb-spark) if sticking
with Scala?
On Thu, Apr 20, 2017 at 21:46 Nan Zhu  wrote:

> DocDB does have a java client? Anything prevent you using that?
>
> Get Outlook for iOS 
> --
> *From:* ayan guha 
> *Sent:* Thursday, April 20, 2017 9:24:03 PM
> *To:* Ashish Singh
> *Cc:* user
> *Subject:* Re: Azure Event Hub with Pyspark
>
> Hi
>
> yes, its only scala. I am looking for a pyspark version, as i want to
> write to documentDB which has good python integration.
>
> Thanks in advance
>
> best
> Ayan
>
> On Fri, Apr 21, 2017 at 2:02 PM, Ashish Singh 
> wrote:
>
>> Hi ,
>>
>> You can try https://github.com/hdinsight/spark-eventhubs : which is
>> eventhub receiver for spark streaming
>> We are using it but you have scala version only i guess
>>
>>
>> Thanks,
>> Ashish Singh
>>
>> On Fri, Apr 21, 2017 at 9:19 AM, ayan guha  wrote:
>>
>>> [image: Boxbe]  This message is
>>> eligible for Automatic Cleanup! (guha.a...@gmail.com) Add cleanup rule
>>> 
>>> | More info
>>> 
>>>
>>> Hi
>>>
>>> I am not able to find any conector to be used to connect spark streaming
>>> with Azure Event Hub, using pyspark.
>>>
>>> Does anyone know if there is such library/package exists>?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Support Stored By Clause

2017-03-27 Thread Denny Lee
Per SPARK-19630, wondering if there are plans to support "STORED BY" clause
for Spark 2.x?

Thanks!


Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-21 Thread Dongjin Lee
Hi Chetan,

Sadly, you can not; Spark is configured to ignore the null values when
writing JSON. (check JacksonMessageWriter and find
JsonInclude.Include.NON_NULL from the code.) If you want that
functionality, it would be much better to file the problem to JIRA.

Best,
Dongjin

On Mon, Mar 20, 2017 at 4:44 PM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> Exactly.
>
> On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee <dong...@apache.org> wrote:
>
>> Hello Chetan,
>>
>> Could you post some code? If I understood correctly, you are trying to
>> save JSON like:
>>
>> {
>>   "first_name": "Dongjin",
>>   "last_name: null
>> }
>>
>> not in omitted form, like:
>>
>> {
>>   "first_name": "Dongjin"
>> }
>>
>> right?
>>
>> - Dongjin
>>
>> On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Dev / Users,
>>>
>>> I am working with PySpark Code migration to scala, with Python -
>>> Iterating Spark with dictionary and generating JSON with null is possible
>>> with json.dumps() which will be converted to SparkSQL[Row] but in scala how
>>> can we generate json will null values as a Dataframe ?
>>>
>>> Thanks.
>>>
>>
>>
>>
>> --
>> *Dongjin Lee*
>>
>>
>> *Software developer in Line+.So interested in massive-scale machine
>> learning.facebook: www.facebook.com/dongjin.lee.kr
>> <http://www.facebook.com/dongjin.lee.kr>linkedin: 
>> kr.linkedin.com/in/dongjinleekr
>> <http://kr.linkedin.com/in/dongjinleekr>github:
>> <http://goog_969573159/>github.com/dongjinleekr
>> <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
>> <http://www.twitter.com/dongjinleekr>*
>>
>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
<http://www.facebook.com/dongjin.lee.kr>linkedin:
kr.linkedin.com/in/dongjinleekr
<http://kr.linkedin.com/in/dongjinleekr>github:
<http://goog_969573159/>github.com/dongjinleekr
<http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
<http://www.twitter.com/dongjinleekr>*


Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-11 Thread Dongjin Lee
Hello Chetan,

Could you post some code? If I understood correctly, you are trying to save
JSON like:

{
  "first_name": "Dongjin",
  "last_name: null
}

not in omitted form, like:

{
  "first_name": "Dongjin"
}

right?

- Dongjin

On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> Hello Dev / Users,
>
> I am working with PySpark Code migration to scala, with Python - Iterating
> Spark with dictionary and generating JSON with null is possible with
> json.dumps() which will be converted to SparkSQL[Row] but in scala how can
> we generate json will null values as a Dataframe ?
>
> Thanks.
>



-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
<http://www.facebook.com/dongjin.lee.kr>linkedin:
kr.linkedin.com/in/dongjinleekr
<http://kr.linkedin.com/in/dongjinleekr>github:
<http://goog_969573159/>github.com/dongjinleekr
<http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
<http://www.twitter.com/dongjinleekr>*


Re: unsubscribe

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com
wrote:

Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com 
cont...@chrissmurphy.com
wrote:
PLEASE!!

Pretrained Word2Vec models

2016-12-05 Thread Lee Becker
Hi all,

Is there a way for Spark to load Word2Vec models trained using gensim
<https://radimrehurek.com/gensim/> or the original C implementation
<https://code.google.com/archive/p/word2vec/> of Word2Vec?  Specifically
I'd like to play with the Google News model
<https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing>
or
the Freebase model
<https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing>
to
see how they perform before training my own.

Thanks,
Lee


Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the
overhead of opening up files.  Typical guidance is between 64MB-1GB;
personally I usually stick with 128MB-512MB with the default of snappy
codec compression with parquet.  A good reference is Vida Ha's
presentation Data
Storage Tips for Optimal Spark Performance
.


On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:

> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from
> Spark ?
>
> As Spark app write data to parquet and it shows that under that directory
> there are heaps of very small parquet file (such as
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
> 15KB
>
> Should it write each chunk of  bigger data size (such as 128 MB) with
> proper number of files ?
>
> Does anyone find out any performance changes when changing data size of
> each parquet file ?
>
> Thanks,
> Kevin.
>


Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee
There are a number of great resources to learn Apache Spark - a good
starting point is the Apache Spark Documentation at:
http://spark.apache.org/documentation.html


The two books that immediately come to mind are

- Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do
(there's also a Chinese language version of this book)

- Advanced Analytics with Apache Spark:
http://shop.oreilly.com/product/mobile/0636920035091.do

You can also find a pretty decent listing of Apache Spark resources at:
https://sparkhub.databricks.com/resources/

HTH!


On Sun, Nov 6, 2016 at 19:00 litg <1933443...@qq.com> wrote:

>I'm a postgraduate from  Shanghai Jiao Tong University,China.
> recently, I
> carry out a project about the  realization of artificial algorithms on
> spark
> in python. however, I am not familiar with this field.furthermore,there are
> few Chinese books about spark.
>  Actually,I strongly want to have a further study at this field.hope
> someone can  kindly recommend me some books about  the mechanism of spark,
> or just give me suggestions about how to  program with spark.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hope-someone-can-recommend-some-books-for-me-a-spark-beginner-tp28033.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee
The one you're looking for is the Data Sciences and Engineering with Apache
Spark at
https://www.edx.org/xseries/data-science-engineering-apacher-sparktm.

Note, a great quick start is the Getting Started with Apache Spark on
Databricks at https://databricks.com/product/getting-started-guide

HTH!

On Sun, Nov 6, 2016 at 22:20 Raghav  wrote:

> Can you please point out the right courses from EDX/Berkeley ?
>
> Many thanks.
>
> On Sun, Nov 6, 2016 at 6:08 PM, ayan guha  wrote:
>
> I would start with Spark documentation, really. Then you would probably
> start with some older videos from youtube, especially spark summit
> 2014,2015 and 2016 videos. Regading practice, I would strongly suggest
> Databricks cloud (or download prebuilt from spark site). You can also take
> courses from EDX/Berkley, which are very good starter courses.
>
> On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:
>
> I am newbie in the world of big data analytics, and I want to teach myself
> Apache Spark, and want to be able to write scripts to tinker with data.
>
> I have some understanding of Map Reduce but have not had a chance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a
BroadcastHashJoin so that way you can join to that table presuming its
small enough to distributed?  Here's a handy guide on a BroadcastHashJoin:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit  wrote:

> I have a lookup table in HANA database. I want to create a spark broadcast
> variable for it.
> What would be the suggested approach? Should I read it as an data frame
> and convert data frame into broadcast variable?
>
> Thanks,
> Nishit
>


Re: GraphFrame BFS

2016-11-01 Thread Denny Lee
You should be able to GraphX or GraphFrames subgraph to build up your
subgraph.  A good example for GraphFrames can be found at:
http://graphframes.github.io/user-guide.html#subgraphs.  HTH!

On Mon, Oct 10, 2016 at 9:32 PM cashinpj  wrote:

> Hello,
>
> I have a set of data representing various network connections.  Each vertex
> is represented by a single id, while the edges have  a source id,
> destination id, and a relationship (peer to peer, customer to provider, or
> provider to customer).  I am trying to create a sub graph build around a
> single source node following one type of edge as far as possible.
>
> For example:
> 1 2 p2p
> 2 3 p2p
> 2 3 c2p
>
> Following the p2p edges would give:
>
> 1 2 p2p
> 2 3 p2p
>
> I am pretty new to GraphX and GraphFrames, but was wondering if it is
> possible to get this behavior using the GraphFrames bfs() function or would
> it be better to modify the already existing Pregel implementation of bfs?
>
> Thank you for your time.
>
> Padraic
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphFrame-BFS-tp27876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Lee Becker
On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose 
wrote:

> I wish to organize these dataframe operations by grouping them Scala
> Object methods.
> Something like below
>
>
>
>> *Object Driver {*
>> *def main(args: Array[String]) {*
>> *  val df = Operations.process(sparkContext)*
>> *  }**}*
>>
>>
>> *Object Operations {*
>> *  def process(sparkContext: SparkContext) : DataFrame = {*
>> *//series of dataframe operations *
>> *  }**}*
>
>
> My stupid question is would retrieving DF from other Scala Object's method
> as return type is right thing do in terms of large scale.
> Would returning DF to driver will cause all data get passed to the driver
> code or it would be return just pointer to the DF?
>

As long as the methods do not trigger any executions, it is fine to pass a
DataFrame back to the driver.  Think of a DataFrame as an abstraction over
RDDs.  When you return an RDD or DataFrame you're not returning the object
itself.  Instead you're returning a recipe that details the series of
operations needed to produce the data.


collect_set without nulls (1.6 vs 2.0)

2016-09-07 Thread Lee Becker
Hello everyone,

Consider this toy example:

case class Foo(x: String, y: String)
val df = sparkSession.createDataFrame(Array(Foo(null), Foo("a"), Foo("b"))
df.select(collect_set($"x")).show

In Spark 2.0.0 I get the following results:

+--+
|collect_set(x)|
+--+
|  [null, b, a]|
+--+

In 1.6.* the same collect_set produces:

+--+
|collect_set(x)|
+--+
|  [b, a]|
+--+

Is there any way to get this aggregation to ignore nulls?  I understand the
trivial way would be to filter on x beforehand, but in my actual use case
I'm calling the collect_set in withColumn over a window specification, so I
want empty arrays on rows with nulls.

For now I'm using this hack of a workaround:

val removenulls = udf((l: scala.collection.mutable.WrappedArray[String]) =>
l.filter(x=>x != null))
f.select(removenulls(collect_set($"x"))).show


Any suggestions are appreciated.

Thanks,
Lee


dynamic allocation in Spark 2.0

2016-08-24 Thread Shane Lee
Hello all,
I am running hadoop 2.6.4 with Spark 2.0 and I have been trying to get dynamic 
allocation to work without success. I was able to get it to work with Spark 
16.1 however.
When I issue the commandspark-shell --master yarn --deploy-mode client
this is the error I see:
16/08/24 00:05:40 WARN NettyRpcEndpointRef: Error sending message [message = 
RequestExecutors(1,0,Map())] in 1 
attemptsorg.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
seconds]. This timeout is controlled by spark.rpc.askTimeout        at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
        at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
        at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)        
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)        at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)      
  at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)  
      
atorg.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:128)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:493)
        at 
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1482)    
    at 
org.apache.spark.ExecutorAllocationManager.start(ExecutorAllocationManager.scala:235)
        at 
org.apache.spark.SparkContext$$anonfun$21.apply(SparkContext.scala:534)        
at org.apache.spark.SparkContext$$anonfun$21.apply(SparkContext.scala:534)      
  at scala.Option.foreach(Option.scala:257)        at 
org.apache.spark.SparkContext.(SparkContext.scala:534)        at 
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)        at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
        at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
        at scala.Option.getOrElse(Option.scala:121)        at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)   
     at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)        at 
$line3.$read$$iw$$iw.(:15)        at 
$line3.$read$$iw.(:31)        at 
$line3.$read.(:33)        at $line3.$read$.(:37)  
      at $line3.$read$.()        at 
$line3.$eval$.$print$lzycompute(:7)        at 
$line3.$eval$.$print(:6)        at $line3.$eval.$print()      
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)        at 
scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)        at 
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)        
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
        at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
        at 
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
        at 
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
        at 
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) 
       at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)        
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)        at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)        
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)        at 
scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)        at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
        at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)     
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)     
   at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)        at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
        at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)     
   at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)     
   at 
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
        at 

Re: countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker <lee.bec...@hapara.com> wrote:

> val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
> "a"))).toDF("x", "y")
> val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))
>

This workaround executes with no exceptions:
val grouped = df.groupBy($"x").agg(size(collect_set($"y")),
collect_set($"y"))

In this example countDistinct and collect_set are running on the same
column and thus the result of countDistinct is essentially redundant.
Assuming they were running on different columns (say there was column 'z'
too), is there anything distinct computationally between countDistinct and
size(collect_set(...))?

-- 
*hapara* ● Making Learning Visible
1877 Broadway Street, Boulder, CO 80302
(Google Voice): +1 720 335 5332
www.hapara.com   Twitter: @hapara_team <http://twitter.com/hapara_team>


countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
Hi everyone,

I've started experimenting with my codebase to see how much work I will
need to port it from 1.6.1 to 2.0.0.  In regressing some of my dataframe
transforms, I've discovered I can no longer pair a countDistinct with a
collect_set in the same aggregation.

Consider:

val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
"a"))).toDF("x", "y")
val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))

When it comes time to execute (via collect or show).  I get the following
error:

*java.lang.RuntimeException: Distinct columns cannot exist in Aggregate
> operator containing aggregate functions which don't support partial
> aggregation.*


I never encountered this behavior in previous Spark versions.  Are there
workarounds that don't require computing each aggregation separately and
joining later?  Is there a partial aggregation version of collect_set?

Thanks,
Lee


Re: SparkR error when repartition is called

2016-08-09 Thread Shane Lee
Sun,
I am using spark in yarn client mode in a 2-node cluster with hadoop-2.7.2. My 
R version is 3.3.1.
I have the following in my spark-defaults.conf:spark.executor.extraJavaOptions 
=-XX:+PrintGCDetails 
-XX:+HeapDumpOnOutOfMemoryErrorspark.r.command=c:/R/R-3.3.1/bin/x64/Rscriptspark.ui.killEnabled=truespark.executor.instances
 = 3
spark.serializer = 
org.apache.spark.serializer.KryoSerializerspark.shuffle.file.buffer = 
1mspark.driver.maxResultSize=0spark.executor.memory=8gspark.executor.cores = 6 
I also ran into some other R errors that I was able to bypass by modifying the 
worker.R file (attached). In a nutshell I was getting the "argument is length 
of zero" error sporadically so I put in extra checks for it.
Thanks,
Shane
On Monday, August 8, 2016 11:53 PM, Sun Rui <sunrise_...@163.com> wrote:
 

 I can’t reproduce your issue with len=1 in local mode.Could you give more 
environment information?

On Aug 9, 2016, at 11:35, Shane Lee <shane_y_...@yahoo.com.INVALID> wrote:
Hi All,
I am trying out SparkR 2.0 and have run into an issue with repartition. 
Here is the R code (essentially a port of the pi-calculating scala example in 
the spark package) that can reproduce the behavior:
schema <- structType(structField("input", "integer"), structField("output", 
"integer"))
library(magrittr)

len = 3000data.frame(n = 1:len) %>%as.DataFrame %>%
SparkR:::repartition(10L) %>% dapply(., function (df) { library(plyr) ddply(df, 
.(n), function (y)
 { data.frame(z =  { x1 = runif(1) * 2 - 1 y1 = runif(1) * 2 - 1 z = x1 * x1 + 
y1 * y1 if (z < 1) { 1L } else { 0L } }) }) } , schema ) %>%  
SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len
For me it runs fine as long as len is less than 5000, otherwise it errors out 
with the following message:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 56.0 
(TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation failed 
with Error in readBin(con, raw(), stringLen, endian = "big") :   invalid 'n' 
argumentCalls:  -> readBinExecution halted at 
org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
 at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
 at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
 at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$
If the repartition call is removed, it runs fine again, even with very large 
len.
After looking through the documentations and searching the web, I can't seem to 
find any clues how to fix this. Anybody has seen similary problem?
Thanks in advance for your help.
Shane




  

worker.R
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

SparkR error when repartition is called

2016-08-08 Thread Shane Lee
Hi All,
I am trying out SparkR 2.0 and have run into an issue with repartition. 
Here is the R code (essentially a port of the pi-calculating scala example in 
the spark package) that can reproduce the behavior:
schema <- structType(structField("input", "integer"), structField("output", 
"integer"))
library(magrittr)

len = 3000data.frame(n = 1:len) %>%as.DataFrame %>%
SparkR:::repartition(10L) %>% dapply(., function (df) { library(plyr) ddply(df, 
.(n), function (y)
 { data.frame(z =  { x1 = runif(1) * 2 - 1 y1 = runif(1) * 2 - 1 z = x1 * x1 + 
y1 * y1 if (z < 1) { 1L } else { 0L } }) }) } , schema ) %>%  
SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len
For me it runs fine as long as len is less than 5000, otherwise it errors out 
with the following message:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 56.0 
(TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation failed 
with Error in readBin(con, raw(), stringLen, endian = "big") :   invalid 'n' 
argumentCalls:  -> readBinExecution halted at 
org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
 at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
 at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
 at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$
If the repartition call is removed, it runs fine again, even with very large 
len.
After looking through the documentations and searching the web, I can't seem to 
find any clues how to fix this. Anybody has seen similary problem?
Thanks in advance for your help.
Shane


Re: Spark GraphFrames

2016-08-02 Thread Denny Lee
Hi Divya,

Here's a blog post concerning On-Time Flight Performance with GraphFrames:
https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html

It also includes a Databricks notebook that has the code in it.

HTH!
Denny


On Tue, Aug 2, 2016 at 1:16 AM Kazuaki Ishizaki  wrote:

> Sorry
> Please ignore this mail. Sorry for misinterpretation of GraphFrame in
> Spark. I thought that Frame Graph for profiling tool.
>
> Kazuaki Ishizaki,
>
>
>
> From:Kazuaki Ishizaki/Japan/IBM@IBMJP
> To:Divya Gehlot 
> Cc:"user @spark" 
> Date:2016/08/02 17:06
> Subject:Re: Spark GraphFrames
> --
>
>
>
> Hi,
> Kay wrote a procedure to use GraphFrames with Spark.
> *https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4*
> 
>
> Kazuaki Ishizaki
>
>
>
> From:Divya Gehlot 
> To:"user @spark" 
> Date:2016/08/02 14:52
> Subject:Spark GraphFrames
> --
>
>
>
> Hi,
>
> Has anybody has worked with GraphFrames.
> Pls let me know as I need to know the real case scenarios where It can
> used .
>
>
> Thanks,
> Divya
>
>
>


Spark 2.0 preview - How to configure warehouse for Catalyst? always pointing to /user/hive/warehouse

2016-06-17 Thread Andrew Lee
>From branch-2.0, Spark 2.0.0 preview,

I found it interesting, no matter what you do by configuring


spark.sql.warehouse.dir


it will always pull up the default path which is /user/hive/warehouse


In the code, I notice that at LOC45

./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala


object SimpleAnalyzer extends Analyzer(

new SessionCatalog(

  new InMemoryCatalog,

  EmptyFunctionRegistry,

  new SimpleCatalystConf(caseSensitiveAnalysis = true)),

new SimpleCatalystConf(caseSensitiveAnalysis = true))


It will always initialize with the SimpleCatalystConf which is applying the 
hardcoded default value

defined in LOC58


./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystConf.scala


case class SimpleCatalystConf(

caseSensitiveAnalysis: Boolean,

orderByOrdinal: Boolean = true,

groupByOrdinal: Boolean = true,

optimizerMaxIterations: Int = 100,

optimizerInSetConversionThreshold: Int = 10,

maxCaseBranchesForCodegen: Int = 20,

runSQLonFile: Boolean = true,

warehousePath: String = "/user/hive/warehouse")

  extends CatalystConf


I couldn't find any other way to get around this.


It looks like this was fixed (in SPARK-15387) after


https://github.com/apache/spark/commit/9c817d027713859cac483b4baaaf8b53c040ad93

[https://avatars0.githubusercontent.com/u/4736016?v=3=200]

[SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to ... · 
apache/spark@9c817d0
github.com
...make database directory. ## What changes were proposed in this pull request? 
After #12871 is fixed, we are forced to make `/user/hive/warehouse` when 
SimpleAnalyzer is used but SimpleAnalyzer ma...


Just want to confirm this was the root cause and the PR that fixed it. Thanks.






Re: streaming example has error

2016-06-15 Thread Lee Ho Yeung
a:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.IllegalArgumentException: requirement failed: No output
operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at
org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:46)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:50)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:52)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:54)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:60)
at $iwC$$iwC$$iwC$$iwC.(:62)
at $iwC$$iwC$$iwC.(:64)
at $iwC$$iwC.(:66)
at $iwC.(:68)
at (:70)
at .(:74)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


scala> ssc.awaitTermination()


On Wed, Jun 15, 2016 at 8:53 PM, David Newberger <
david.newber...@wandcorp.com> wrote:

> Have you tried to “set spark.driver.allowMultipleContexts = true”?
>
>
>
> *David Newberger*
>
>
>
> *From:* Lee Ho Yeung [mailto:jobmatt...@gmail.com]
> *Sent:* Tuesday, June 14, 2016 8:34 PM
> *To:* user@spark.apache.org
> *Subject:* streaming example has error
>
>
>
> when simulate streaming with nc -lk 
>
> got error below,
>
> then i try example,
>
> martin@ubuntu:~/Downloads$
> /home/martin/Downloads/spark-1.6.1/bin/run-example
> streaming.NetworkWordCount localhost 
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 16/06/14 18:33:06 INFO StreamingExamples: Setting log level to [WARN] for
> streaming example. To override add a custom log4j.properties to the
> classpath.
> 16/06/14 18:33:06 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16

can spark help to prevent memory error for itertools.combinations(initlist, 2) in python script

2016-06-15 Thread Lee Ho Yeung
i write a python script which has itertools.combinations(initlist, 2)

but it got error when number of elements in initlist over 14,000

is it possible to use spark to do this work?

i have seen yatel can do this, is spark and yatel using hard disk as memory?

if so,

which need to change in python code ?


Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Hi Mich,

i find my problem cause now, i missed setting delimiter which is tab,

but it got error,

and i notice that only libre office and open and read well, even if Excel
in window, it still can not separate in well format

scala> val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").option("delimiter",
"").load("/home/martin/result002.csv")
java.lang.StringIndexOutOfBoundsException: String index out of range: 0


On Wed, Jun 15, 2016 at 12:14 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> there may be an issue with data in your csv file. like blank header line
> etc.
>
> sounds like you have an issue there. I normally get rid of blank lines
> before putting csv file in hdfs.
>
> can you actually select from that temp table. like
>
> sql("select TransactionDate, TransactionType, Description, Value, Balance,
> AccountName, AccountNumber from tmp").take(2)
>
> replace those with your column names. they are mapped using case class
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 June 2016 at 03:02, Lee Ho Yeung <jobmatt...@gmail.com> wrote:
>
>> filter also has error
>>
>> 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> scala> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SQLContext
>>
>> scala> val sqlContext = new SQLContext(sc)
>> sqlContext: org.apache.spark.sql.SQLContext =
>> org.apache.spark.sql.SQLContext@3114ea
>>
>> scala> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>> 16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether
>> UseCompressedOops is set; assuming yes
>> Java HotSpot(TM) Client VM warning: You have loaded library
>> /tmp/libnetty-transport-native-epoll7823347435914767500.so which might have
>> disabled stack guard. The VM will try to fix the stack guard now.
>> It's highly recommended that you fix the library with 'execstack -c
>> ', or link it with '-z noexecstack'.
>> df: org.apache.spark.sql.DataFrame = [a0a1a2a3a4a5
>> a6a7a8a9: string]
>>
>> scala> df.printSchema()
>> root
>>  |-- a0a1a2a3a4a5a6a7a8a9: string
>> (nullable = true)
>>
>>
>> scala> df.registerTempTable("sales")
>>
>> scala> df.filter($"a0".contains("found
>> deep=1")).filter($"a1".contains("found
>> deep=1")).filter($"a2".contains("found deep=1"))
>> org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input
>> columns: [a0a1a2a3a4a5a6a7a8a9];
>> at
>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>
>>
>>
>>
>>
>> On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung <jobmatt...@gmail.com>
>> wrote:
>>
>>> after tried following commands, can not show data
>>>
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing
>>>
>>> /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
>>> com.databricks:spark-csv_2.11:1.4.0
>>>
>>> import org.apache.spark.sql.SQLContext
>>>
>>> val sqlContext = new SQLContext(sc)
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> val aggDF = sqlContext.sql("select * from sales where a0 like
>>> \"%deep=3%\"")
>>> df.collect.foreach(println)
>>> aggDF.collect.foreach(println)
>>>
>>>
>>>
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> sqlContext.sql("select * from sales").take(30).foreach(println)
>>>
>>
>>
>


Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Hi Mich,

https://drive.google.com/file/d/0Bxs_ao6uuBDUQ2NfYnhvUl9EZXM/view?usp=sharing
https://drive.google.com/file/d/0Bxs_ao6uuBDUS1UzTWd1Q2VJdEk/view?usp=sharing

this time I ensure headers cover all data, only some columns which have
headers do not have data

but still can not show all data like i open libre office

/home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
com.databricks:spark-csv_2.11:1.4.0
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("/home/martin/result002.csv")
df.printSchema()
df.registerTempTable("sales")
df.filter($"a3".contains("found deep=1"))





On Tue, Jun 14, 2016 at 9:14 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> there may be an issue with data in your csv file. like blank header line
> etc.
>
> sounds like you have an issue there. I normally get rid of blank lines
> before putting csv file in hdfs.
>
> can you actually select from that temp table. like
>
> sql("select TransactionDate, TransactionType, Description, Value, Balance,
> AccountName, AccountNumber from tmp").take(2)
>
> replace those with your column names. they are mapped using case class
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 June 2016 at 03:02, Lee Ho Yeung <jobmatt...@gmail.com> wrote:
>
>> filter also has error
>>
>> 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> scala> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SQLContext
>>
>> scala> val sqlContext = new SQLContext(sc)
>> sqlContext: org.apache.spark.sql.SQLContext =
>> org.apache.spark.sql.SQLContext@3114ea
>>
>> scala> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>> 16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether
>> UseCompressedOops is set; assuming yes
>> Java HotSpot(TM) Client VM warning: You have loaded library
>> /tmp/libnetty-transport-native-epoll7823347435914767500.so which might have
>> disabled stack guard. The VM will try to fix the stack guard now.
>> It's highly recommended that you fix the library with 'execstack -c
>> ', or link it with '-z noexecstack'.
>> df: org.apache.spark.sql.DataFrame = [a0a1a2a3a4a5
>> a6a7a8a9: string]
>>
>> scala> df.printSchema()
>> root
>>  |-- a0a1a2a3a4a5a6a7a8a9: string
>> (nullable = true)
>>
>>
>> scala> df.registerTempTable("sales")
>>
>> scala> df.filter($"a0".contains("found
>> deep=1")).filter($"a1".contains("found
>> deep=1")).filter($"a2".contains("found deep=1"))
>> org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input
>> columns: [a0a1a2a3a4a5a6a7a8a9];
>> at
>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>
>>
>>
>>
>>
>> On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung <jobmatt...@gmail.com>
>> wrote:
>>
>>> after tried following commands, can not show data
>>>
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing
>>>
>>> /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
>>> com.databricks:spark-csv_2.11:1.4.0
>>>
>>> import org.apache.spark.sql.SQLContext
>>>
>>> val sqlContext = new SQLContext(sc)
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> val aggDF = sqlContext.sql("select * from sales where a0 like
>>> \"%deep=3%\"")
>>> df.collect.foreach(println)
>>> aggDF.collect.foreach(println)
>>>
>>>
>>>
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> sqlContext.sql("select * from sales").take(30).foreach(println)
>>>
>>
>>
>


Re: can not show all data for this table

2016-06-14 Thread Lee Ho Yeung
filter also has error

16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
Spark context available as sc.
SQL context available as sqlContext.

scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext

scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.SQLContext@3114ea

scala> val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("/home/martin/result002.csv")
16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether
UseCompressedOops is set; assuming yes
Java HotSpot(TM) Client VM warning: You have loaded library
/tmp/libnetty-transport-native-epoll7823347435914767500.so which might have
disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c
', or link it with '-z noexecstack'.
df: org.apache.spark.sql.DataFrame = [a0a1a2a3a4a5
a6a7a8a9: string]

scala> df.printSchema()
root
 |-- a0a1a2a3a4a5a6a7a8a9: string
(nullable = true)


scala> df.registerTempTable("sales")

scala> df.filter($"a0".contains("found
deep=1")).filter($"a1".contains("found
deep=1")).filter($"a2".contains("found deep=1"))
org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input
columns: [a0a1a2a3a4a5a6a7a8    a9];
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)




On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung <jobmatt...@gmail.com> wrote:

> after tried following commands, can not show data
>
>
> https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing
>
> https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing
>
> /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
> com.databricks:spark-csv_2.11:1.4.0
>
> import org.apache.spark.sql.SQLContext
>
> val sqlContext = new SQLContext(sc)
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
> df.printSchema()
> df.registerTempTable("sales")
> val aggDF = sqlContext.sql("select * from sales where a0 like
> \"%deep=3%\"")
> df.collect.foreach(println)
> aggDF.collect.foreach(println)
>
>
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("/home/martin/result002.csv")
> df.printSchema()
> df.registerTempTable("sales")
> sqlContext.sql("select * from sales").take(30).foreach(println)
>


streaming example has error

2016-06-14 Thread Lee Ho Yeung
when simulate streaming with nc -lk 

got error below,

then i try example,

martin@ubuntu:~/Downloads$
/home/martin/Downloads/spark-1.6.1/bin/run-example
streaming.NetworkWordCount localhost 
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
16/06/14 18:33:06 INFO StreamingExamples: Setting log level to [WARN] for
streaming example. To override add a custom log4j.properties to the
classpath.
16/06/14 18:33:06 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/06/14 18:33:06 WARN Utils: Your hostname, ubuntu resolves to a loopback
address: 127.0.1.1; using 192.168.157.134 instead (on interface eth0)
16/06/14 18:33:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
16/06/14 18:33:13 WARN SizeEstimator: Failed to check whether
UseCompressedOops is set; assuming yes


got error too.

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", )
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start()
ssc.awaitTermination()



scala> import org.apache.spark._
import org.apache.spark._

scala> import org.apache.spark.streaming._
import org.apache.spark.streaming._

scala> import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.StreamingContext._

scala> val conf = new
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@67bcaf

scala> val ssc = new StreamingContext(conf, Seconds(1))
16/06/14 18:28:44 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
at
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
at
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at $line37.$read$$iwC$$iwC$$iwC$$iwC.(:59)
at $line37.$read$$iwC$$iwC$$iwC.(:61)
at $line37.$read$$iwC$$iwC.(:63)
at $line37.$read$$iwC.(:65)
at $line37.$read.(:67)
at $line37.$read$.(:71)
at $line37.$read$.()
at $line37.$eval$.(:7)
at $line37.$eval$.()
at $line37.$eval.$print()
at 

can not show all data for this table

2016-06-14 Thread Lee Ho Yeung
after tried following commands, can not show data

https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing
https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing

/home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
com.databricks:spark-csv_2.11:1.4.0

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("/home/martin/result002.csv")
df.printSchema()
df.registerTempTable("sales")
val aggDF = sqlContext.sql("select * from sales where a0 like \"%deep=3%\"")
df.collect.foreach(println)
aggDF.collect.foreach(println)



val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("/home/martin/result002.csv")
df.printSchema()
df.registerTempTable("sales")
sqlContext.sql("select * from sales").take(30).foreach(println)


Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-09 Thread Andrew Lee
In fact, it does require ojdbc from Oracle which also requires a username and 
password. This was added as part of the testing scope for Oracle's docker.


I notice this PR and commit in branch-2.0 according to 
https://issues.apache.org/jira/browse/SPARK-12941.

In the comment, I'm not sure what does it mean by installing the JAR locally 
while Spark QA test run. IF this is the case,

it means someone downloaded the JAR from Oracle and manually added to the local 
build machine that is building Spark branch-2.0 or internal maven repository 
that will serve this ojdbc JAR.




commit 8afe49141d9b6a603eb3907f32dce802a3d05172

Author: thomastechs 

Date:   Thu Feb 25 22:52:25 2016 -0800


[SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map 
string datatypes to Oracle VARCHAR datatype



## What changes were proposed in this pull request?



This Pull request is used for the fix SPARK-12941, creating a data type 
mapping to Oracle for the corresponding data type"Stringtype" from

dataframe. This PR is for the master branch fix, where as another PR is already 
tested with the branch 1.4



## How was the this patch tested?



(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

This patch was tested using the Oracle docker .Created a new integration 
suite for the same.The oracle.jdbc jar was to be downloaded from the maven 
repository.Since there was no jdbc jar available in the maven repository, the 
jar was downloaded from oracle site manually and installed in the local; thus 
tested. So, for SparkQA test case run, the ojdbc jar might be manually placed 
in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test 
run.



Author: thomastechs 



Closes #11306 from thomastechs/master.




Meanwhile, I also notice that the ojdbc groupID provided by Oracle (official 
website https://blogs.oracle.com/dev2dev/entry/how_to_get_oracle_jdbc)  is 
different.




  com.oracle.jdbc

  ojdbc6

  11.2.0.4

  test




as oppose to the one in Spark branch-2.0

external/docker-integration-tests/pom.xml




  com.oracle

  ojdbc6

  11.2.0.1.0

  test





The version is out of date and not available from the Oracle Maven repo. The PR 
was created awhile back, so the solution may just cross Oracle's maven release 
blog.


Just my inference based on what I see form git and JIRA, however, I do see a 
fix required to patch pom.xml to apply the correct groupId and version # for 
ojdbc6 driver.


Thoughts?



Get Oracle JDBC drivers and UCP from Oracle Maven 
...
blogs.oracle.com
Get Oracle JDBC drivers and UCP from Oracle Maven Repository (without IDEs) By 
Nirmala Sundarappa-Oracle on Feb 15, 2016









From: Mich Talebzadeh 
Sent: Tuesday, May 3, 2016 1:04 AM
To: Luciano Resende
Cc: Hien Luu; ☼ R Nair (रविशंकर नायर); user
Subject: Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

which version of Spark are using?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 3 May 2016 at 02:13, Luciano Resende 
> wrote:
You might have a settings.xml that is forcing your internal Maven repository to 
be the mirror of external repositories and thus not finding the dependency.

On Mon, May 2, 2016 at 6:11 PM, Hien Luu 
> wrote:
Not I am not.  I am considering downloading it manually and place it in my 
local repository.

On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) 
> wrote:

Oracle jdbc is not part of Maven repository,  are you keeping a downloaded file 
in your local repo?

Best, RS

On May 2, 2016 8:51 PM, "Hien Luu" 
> wrote:
Hi all,

I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.  It 
kept getting "Operation timed out" while building Spark Project Docker 
Integration Tests module (see the error below).

Has anyone run this problem before? If so, how did you resolve around this 
problem?

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [  2.423 s]

[INFO] Spark Project Test Tags  SUCCESS [  0.712 s]

[INFO] Spark Project Sketch ... SUCCESS [  0.498 s]

[INFO] Spark Project Networking ... SUCCESS [  1.743 s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.587 s]

[INFO] Spark Project Unsafe ... SUCCESS [  0.503 s]


Re: Dataset aggregateByKey equivalent

2016-04-25 Thread Lee Becker
On Sat, Apr 23, 2016 at 8:56 AM, Michael Armbrust 
wrote:

> Have you looked at aggregators?
>
>
> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
>

Thanks for the pointer to aggregators.  I wasn't yet aware of them.
However, I still get similar errors when attempting to go the Dataset
route.  Here is my attempt at an aggregator class for the example above.


import org.apache.spark.sql.expressions.Aggregator

case class KeyVal(k: Int, v: Int)

val keyValsDs = sc.parallelize(for (i <- 1 to 3; j <- 4 to 6) yield
KeyVal(i,j)).toDS

class AggKV extends Aggregator[KeyVal, List[KeyVal], List[KeyVal]]
with Serializable {
  override def zero: List[KeyVal] = List()

  override def reduce(b: List[KeyVal], a: KeyVal): List[KeyVal] = b :+ a

  override def finish(reduction: List[KeyVal]): List[KeyVal] = reduction

  override def merge(b1: List[KeyVal], b2: List[KeyVal]): List[KeyVal]
= b1 ++ b2
}

The following shows production of the correct schema

keyValsDs.groupBy($"k").agg(new AggKV().toColumn)

 org.apache.spark.sql.Dataset[(org.apache.spark.sql.Row, List[KeyVal])] =
[_1: struct, _2: struct>>]

Actual execution fails with Task not serializable.  Am I missing something
or is this just not possible without dropping into RDDs?

scala> keyValsDs.groupBy($"k").agg(new AggKV().toColumn).show
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
tree:
SortBasedAggregate(key=[k#684],
functions=[(AggKV(k#684,v#685),mode=Final,isDistinct=false)],
output=[key#739,AggKV(k,v)#738])
+- ConvertToSafe
   +- Sort [k#684 ASC], false, 0
  +- TungstenExchange hashpartitioning(k#684,200), None
 +- ConvertToUnsafe
+- SortBasedAggregate(key=[k#684],
functions=[(AggKV(k#684,v#685),mode=Partial,isDistinct=false)],
output=[k#684,value#731])
   +- ConvertToSafe
  +- Sort [k#684 ASC], false, 0
 +- Project [k#684,v#685,k#684]
+- Scan ExistingRDD[k#684,v#685]

at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregate.doExecute(SortBasedAggregate.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1537)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1544)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1414)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2138)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
at org.apache.spark.sql.Dataset.show(Dataset.scala:228)
at org.apache.spark.sql.Dataset.show(Dataset.scala:192)
at org.apache.spark.sql.Dataset.show(Dataset.scala:200)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:50)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:63)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:65)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:67)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:69)
at $iwC$$iwC$$iwC$$iwC.(:71)
at $iwC$$iwC$$iwC.(:73)
at $iwC$$iwC.(:75)
at $iwC.(:77)
at (:79)
at .(:83)
at .()
 

Re: Meetup in Rome

2016-02-19 Thread Denny Lee
Hey Domenico,

Glad to hear that you love Spark and would like to organize a meetup in
Rome. We created a Meetup-in-a-box to help with that - check out the post
https://databricks.com/blog/2015/11/19/meetup-in-a-box.html.

HTH!
Denny



On Fri, Feb 19, 2016 at 02:38 Domenico Pontari 
wrote:

>
> Hi guys,
> I spent till September 2015 in the bay area working with Spark and I love
> it. Now I'm back to Rome and I'd like to organize a meetup about it and Big
> Data in general. Any idea / suggestions? Can you eventually sponsor beers
> and pizza for it?
> Best,
> Domenico
>


reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
What kind of steps exists when reading ORC format on Spark-SQL?
I meant usually reading csv file is just directly reading the dataset on
memory.

But I feel like Spark-SQL has some steps when reading ORC format.
For example, they have to create table to insert the dataset? and then they
insert the dataset to the table? theses steps are reading step in Spark-SQL?

[image: Inline image 1]


Spark Distribution of Small Dataset

2016-01-28 Thread Philip Lee
Hi,

Simple Question about Spark Distribution of Small Dataset.

Let's say I have 8 machine with 48 cores and 48GB of RAM as a cluster.
Dataset  (format is ORC by Hive) is so small like 1GB, but I copied it to
HDFS.

1) if spark-sql run the dataset distributed on HDFS in each machine, what
happens to the job? I meant one machine handles the dataset because it is
so small?

2) but the thing is dataset is already distributed in each machine.
or each machine handles the distributed dataset and send it to the Master
Node?

Could you explain about this in detail in a distributed way?

Best,
Phil


Re: a question about web ui log

2016-01-26 Thread Philip Lee
Yes, I tried it, but it simply does not work.

so, my concern is *to use "ssh tunnel" to forward a port of cluster to
localhost port. *

But in Spark UI, there are two ports which I should forward using "*ssh
tunnel*".
Considering a default port, 8080 is web-ui port to come into web-ui, and
4040 is web-monitoring port to see time execution like DAG in application
details UI, as you probably know.

But after finishing a job, I can see the list of a job on the web-ui on
8080, but when I click "application details UI" on port 4040 to see time
excution, it does not work.

Any suggestion? I really need to see execution of DAG.

Best,
Phil

On Tue, Jan 26, 2016 at 12:04 AM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> I am not sure whether you can copy the log files from Spark workers to
> your local machine and view it from the Web UI. In fact, if you are able to
> copy the log files locally, you can just view them directly in any text
> editor.
>
>
>
> I suspect what you really want to see is the application history. Here is
> the relevant information from Spark’s monitoring page (
> http://spark.apache.org/docs/latest/monitoring.html)
>
>
>
> To view the web UI after the fact, set spark.eventLog.enabled to true
> before starting the application. This configures Spark to log Spark events
> that encode the information displayed in the UI to persisted storage.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Philip Lee [mailto:philjj...@gmail.com]
> *Sent:* Monday, January 25, 2016 9:51 AM
> *To:* user@spark.apache.org
> *Subject:* Re: a question about web ui log
>
>
>
> As I mentioned before, I am tryint to see the spark log on a cluster via
> ssh-tunnel
>
>
>
> 1) The error on application details UI is probably from monitoring porting
> ​4044. Web UI port is 8088, right? so how could I see job web ui view and
> application details UI view in the web ui on my local machine?
>
>
>
> 2) still wondering how to see the log after copyting log file to my local.
>
>
>
> The error was metioned in previous mail.
>
>
>
> Thanks,
>
> Phil
>
>
>
>
>
>
>
> On Mon, Jan 25, 2016 at 5:36 PM, Philip Lee <philjj...@gmail.com> wrote:
>
> ​Hello, a questino about web UI log.
>
>
>
> ​I could see web interface log after forwarding the port on my cluster to
> my local and click completed application, but when I clicked "application
> detail UI"
>
>
>
> [image: Inline image 1]
>
>
>
> It happened to me. I do not know why. I also checked the specific log
> folder. It has a log file in it. Actually, that's why I could click the
> completed application link, right?
>
>
>
> So is it okay for me to copy the log file in my cluster to my local
> machine.
>
> And after turning on spark Job Manger on my local by myself, I could see
> application deatils UI in my local machine?
>
>
>
> Best,
>
> Phil
>
>
>


a question about web ui log

2016-01-25 Thread Philip Lee
​Hello, a questino about web UI log.

​I could see web interface log after forwarding the port on my cluster to
my local and click completed application, but when I clicked "application
detail UI"

[image: Inline image 1]

It happened to me. I do not know why. I also checked the specific log
folder. It has a log file in it. Actually, that's why I could click the
completed application link, right?

So is it okay for me to copy the log file in my cluster to my local machine.
And after turning on spark Job Manger on my local by myself, I could see
application deatils UI in my local machine?

Best,
Phil


Re: a question about web ui log

2016-01-25 Thread Philip Lee
As I mentioned before, I am tryint to see the spark log on a cluster via
ssh-tunnel

1) The error on application details UI is probably from monitoring porting
​4044. Web UI port is 8088, right? so how could I see job web ui view and
application details UI view in the web ui on my local machine?

2) still wondering how to see the log after copyting log file to my local.

The error was metioned in previous mail.

Thanks,
Phil



On Mon, Jan 25, 2016 at 5:36 PM, Philip Lee <philjj...@gmail.com> wrote:

> ​Hello, a questino about web UI log.
>
> ​I could see web interface log after forwarding the port on my cluster to
> my local and click completed application, but when I clicked "application
> detail UI"
>
> [image: Inline image 1]
>
> It happened to me. I do not know why. I also checked the specific log
> folder. It has a log file in it. Actually, that's why I could click the
> completed application link, right?
>
> So is it okay for me to copy the log file in my cluster to my local
> machine.
> And after turning on spark Job Manger on my local by myself, I could see
> application deatils UI in my local machine?
>
> Best,
> Phil
>


Re: How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Denny Lee
Per http://spark.apache.org/docs/latest/submitting-applications.html:

For Python, you can use the --py-files argument of spark-submit to add .py,
.zip or .egg files to be distributed with your application. If you depend
on multiple Python files we recommend packaging them into a .zip or .egg.



On Fri, Jan 8, 2016 at 6:44 PM Ascot Moss  wrote:

> Hi,
>
> Instead of using Spark-shell, does anyone know how to build .zip (or .egg)
> for Python and use Spark-submit to run?
>
> Regards
>


Re: subscribe

2016-01-08 Thread Denny Lee
To subscribe, please go to http://spark.apache.org/community.html to join
the mailing list.


On Fri, Jan 8, 2016 at 3:58 AM Jeetendra Gangele 
wrote:

>
>


Re: Intercept in Linear Regression

2015-12-15 Thread Denny Lee
If you're using
model = LinearRegressionWithSGD.train(parseddata, iterations=100,
step=0.01, intercept=True)

then to get the intercept, you would use
model.intercept

More information can be found at:
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression

HTH!


On Tue, Dec 15, 2015 at 11:06 PM Arunkumar Pillai 
wrote:

>
> How to get intercept in  Linear Regression Model?
>
> LinearRegressionWithSGD.train(parsedData, numIterations)
>
> --
> Thanks and Regards
> Arun
>


Re: Best practises

2015-11-02 Thread Denny Lee
In addition, you may want to check out Tuning and Debugging in Apache Spark
(https://sparkhub.databricks.com/video/tuning-and-debugging-apache-spark/)

On Mon, Nov 2, 2015 at 05:27 Stefano Baghino 
wrote:

> There is this interesting book from Databricks:
> https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
>
> What do you think? Does it contain the info you're looking for? :)
>
> On Mon, Nov 2, 2015 at 2:18 PM, satish chandra j  > wrote:
>
>> HI All,
>> Yes, any such doc will be a great help!!!
>>
>>
>>
>> On Fri, Oct 30, 2015 at 4:35 PM, huangzheng <1106944...@qq.com> wrote:
>>
>>> I have the same question.anyone help us.
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Deepak Sharma";
>>> *发送时间:* 2015年10月30日(星期五) 晚上7:23
>>> *收件人:* "user";
>>> *主题:* Best practises
>>>
>>> Hi
>>> I am looking for any blog / doc on the developer's best practices if
>>> using Spark .I have already looked at the tuning guide on
>>> spark.apache.org.
>>> Please do let me know if any one is aware of any such resource.
>>>
>>> Thanks
>>> Deepak
>>>
>>
>>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>


Spark Survey Results 2015 are now available

2015-10-05 Thread Denny Lee
Thanks to all of you who provided valuable feedback in our Spark Survey
2015.  Because of the survey, we have a better picture of who’s using
Spark, how they’re using it, and what they’re using it to build–insights
that will guide major updates to the Spark platform as we move into Spark’s
next phase of growth. The results are summarized in an info graphic
available here: Spark Survey Results 2015 are now available
.
Thank you to everyone who participated in Spark Survey 2015 and for your
help in shaping Spark’s future!


Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen Lee
scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:121) 
at 
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:125) 
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1269) 
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1203) 
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1210) 
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22) 
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27) 
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29) 
at $iwC$$iwC$$iwC$$iwC$$iwC.(:31) 
at $iwC$$iwC$$iwC$$iwC.(:33) 
at $iwC$$iwC$$iwC.(:35) 
at $iwC$$iwC.(:37) 
at $iwC.(:39) 
at (:41) 
at .(:45) 
at .() 
at .(:7) 
at .() 
at $print() 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) 
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) 
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) 
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) 
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) 
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) 
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) 
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) 
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) 
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) 
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
 
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
 
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) 
at org.apache.spark.repl.Main$.main(Main.scala:31) 
at org.apache.spark.repl.Main.main(Main.scala) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) 
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.createSplit(OrcInputFormat.java:694)
 
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:822)
 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:745) 




--
ca...@korea.com
cazen@samsung.com
http://www.Cazen.co.kr

> 2015. 9. 13., 오후 3:00, Owen O'Malley <omal...@apache.org> 작성:
> 
> Do you have a stack trace of the array out of bounds exception? I don't 
> remember an array out of bounds problem off the top of my head. A stack trace 
> will tell me a lot, obviously.
> 
> If you are using Spark 1.4 that implies H

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen Lee
Good Day!

I think there are some problems between ORC and AWS EMRFS.

When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex 
Exception occured.

I'm sure that it's AWS side issue because there was no exception when trying 
from HDFS or S3NativeFileSystem.

Parquet runs ordinarily but it's inconvenience(Almost our system runs based on 
ORC)

Does anybody knows about this issue?

I've tried spark 1.4.1(EMR 4.0.0) and there are no 1.5 patch-note about this

Thank You
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to ignore MatchError then processing a large json file in spark-sql

2015-08-02 Thread fuellee lee
I'm trying to process a bunch of large json log files with spark, but it
fails every time with `scala.MatchError`, Whether I give it schema or not.

I just want to skip lines that does not match schema, but I can't find how
in docs of spark.

I know write a json parser and map it to json file RDD can get things done,
but I want to use
`sqlContext.read.schema(schema).json(fileNames).selectExpr(...)` because
it's much easier to maintain.

thanks


Re: SQL Server to Spark

2015-07-23 Thread Denny Lee
It sort of depends on optimized. There is a good thread on the topic at
http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded

If you have an archival type strategy, you could do daily BCP extracts out
to load the data into HDFS / S3 / etc. This would result in minimal impact
to SQL Server for the extracts (for that scenario, that was of primary
importance).

On Thu, Jul 23, 2015 at 16:42 vinod kumar vinodsachin...@gmail.com wrote:

 Hi Everyone,

 I am in need to use the table from MsSQLSERVER in SPARK.Any one please
 share me the optimized way for that?

 Thanks in advance,
 Vinod




RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew Or,
Yes, NodeManager was restarted, I also checked the logs to see if the JARs 
appear in the CLASSPATH.
I have also downloaded the binary distribution and use the JAR 
spark-1.4.1-bin-hadoop2.4/lib/spark-1.4.1-yarn-shuffle.jar without success.
Has anyone successfully enabled the spark_shuffle via the documentation 
https://spark.apache.org/docs/1.4.1/job-scheduling.html ??
I'm testing it on Hadoop 2.4.1.
Any feedback or suggestion are appreciated, thanks.

Date: Fri, 17 Jul 2015 15:35:29 -0700
Subject: Re: The auxService:spark_shuffle does not exist
From: and...@databricks.com
To: alee...@hotmail.com
CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org

Hi all,
Did you forget to restart the node managers after editing yarn-site.xml by any 
chance?
-Andrew
2015-07-17 8:32 GMT-07:00 Andrew Lee alee...@hotmail.com:



I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.schedulerBacklogTimeout 10

and yarn-site.xml configured.
property
nameyarn.nodemanager.aux-services/name
valuespark_shuffle,mapreduce_shuffle/value
/property
...
property
nameyarn.nodemanager.aux-services.spark_shuffle.class/name
valueorg.apache.spark.network.yarn.YarnShuffleService/value
/property
and deployed the 2 JARs to NodeManager's classpath 
/opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log and 
the JARs appear in the classpath). I notice that the JAR location is not the 
same as the document in 1.4. I found them under network/yarn/target and 
network/shuffle/target/ after building it with -Phadoop-2.4 -Psparkr -Pyarn 
-Phive -Phive-thriftserver in maven.

















spark-network-yarn_2.10-1.4.1.jar
spark-network-shuffle_2.10-1.4.1.jar


and still getting the following exception.
Exception in thread ContainerLauncher #0 java.lang.Error: 
org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host alee-ci-2058-slave-2.test.foo.com
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host alee-ci-2058-slave-2.test.foo.com
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
Not sure what else am I missing here or doing wrong?
Appreciate any insights or feedback, thanks.

Date: Wed, 8 Jul 2015 09:25:39 +0800
Subject: Re: The auxService:spark_shuffle does not exist
From: zjf...@gmail.com
To: rp...@njit.edu
CC: user@spark.apache.org

Did you enable the dynamic resource allocation ? You can refer to this page for 
how to configure spark shuffle service for yarn.
https://spark.apache.org/docs/1.4.0/job-scheduling.html 
On Tue, Jul 7, 2015 at 10:55 PM, roy rp...@njit.edu wrote:
we tried --master yarn-client with no different result.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards

Jeff Zhang
  

  

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew,
Thanks for the advice. I didn't see the log in the NodeManager, so apparently, 
something was wrong with the yarn-site.xml configuration.
After digging in more, I realize it was an user error. I'm sharing this with 
other people so others may know what mistake I have made.
When I review the configurations, I notice that there was another property 
setting yarn.nodemanager.aux-services in mapred-site.xml. It turns out that 
mapred-site.xml will override the property yarn.nodemanager.aux-services in 
yarn-site.xml, because of this, spark_shuffle service was never enabled.  :(  
err.. 
















After deleting the redundant invalid properties in mapred-site.xml, it starts 
working. I see the following logs from the NodeManager.









2015-07-21 21:24:44,046 INFO org.apache.spark.network.yarn.YarnShuffleService: 
Initializing YARN shuffle service for Spark
2015-07-21 21:24:44,046 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding 
auxiliary service spark_shuffle, spark_shuffle
2015-07-21 21:24:44,264 INFO org.apache.spark.network.yarn.YarnShuffleService: 
Started YARN shuffle service for Spark on port 7337. Authentication is not 
enabled.

Appreciate all and the pointers where to look at. Thanks, problem solved.



Date: Tue, 21 Jul 2015 09:31:50 -0700
Subject: Re: The auxService:spark_shuffle does not exist
From: and...@databricks.com
To: alee...@hotmail.com
CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org

Hi Andrew,
Based on your driver logs, it seems the issue is that the shuffle service is 
actually not running on the NodeManagers, but your application is trying to 
provide a spark_shuffle secret anyway. One way to verify whether the shuffle 
service is actually started is to look at the NodeManager logs for the 
following lines:
Initializing YARN shuffle service for Spark
Started YARN shuffle service for Spark on port X

These should be logged under the INFO level. Also, could you verify whether all 
the executors have this problem, or just a subset? If even one of the NM 
doesn't have the shuffle service, you'll see the stack trace that you ran into. 
It would be good to confirm whether the yarn-site.xml change is actually 
reflected on all NMs if the log statements above are missing.

Let me know if you can get it working. I've run the shuffle service myself on 
the master branch (which will become Spark 1.5.0) recently following the 
instructions and have not encountered any problems.
-Andrew   

RE: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Lee
I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.schedulerBacklogTimeout 10

and yarn-site.xml configured.
property
nameyarn.nodemanager.aux-services/name
valuespark_shuffle,mapreduce_shuffle/value
/property
...
property
nameyarn.nodemanager.aux-services.spark_shuffle.class/name
valueorg.apache.spark.network.yarn.YarnShuffleService/value
/property
and deployed the 2 JARs to NodeManager's classpath 
/opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log and 
the JARs appear in the classpath). I notice that the JAR location is not the 
same as the document in 1.4. I found them under network/yarn/target and 
network/shuffle/target/ after building it with -Phadoop-2.4 -Psparkr -Pyarn 
-Phive -Phive-thriftserver in maven.
















spark-network-yarn_2.10-1.4.1.jarspark-network-shuffle_2.10-1.4.1.jar

and still getting the following exception.
Exception in thread ContainerLauncher #0 java.lang.Error: 
org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host 
alee-ci-2058-slave-2.test.altiscale.com
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host 
alee-ci-2058-slave-2.test.altiscale.com
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
Not sure what else am I missing here or doing wrong?
Appreciate any insights or feedback, thanks.

Date: Wed, 8 Jul 2015 09:25:39 +0800
Subject: Re: The auxService:spark_shuffle does not exist
From: zjf...@gmail.com
To: rp...@njit.edu
CC: user@spark.apache.org

Did you enable the dynamic resource allocation ? You can refer to this page for 
how to configure spark shuffle service for yarn.
https://spark.apache.org/docs/1.4.0/job-scheduling.html 
On Tue, Jul 7, 2015 at 10:55 PM, roy rp...@njit.edu wrote:
we tried --master yarn-client with no different result.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards

Jeff Zhang
  

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee
I went ahead and tested your file and the results from the tests can be
seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.

Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default}
the query ran without any issues.  I was able to recreate the issue with
{Java 7, default}.  I included the commands I used to start the spark-shell
but basically I just used all defaults (no alteration to driver or executor
memory) with the only additional call was with driver-class-path to connect
to MySQL Hive metastore.  This is on OSX Macbook Pro.

One thing I did notice is that your version of Java 7 is version 51 while
my version of Java 7 version 79.  Could you see if updating to Java 7
version 79 perhaps allows you to use the MaxPermSize call?




On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov s...@swoop.com wrote:

  The file is at
 https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1

  The command was included in the gist

  SPARK_REPL_OPTS=-XX:MaxPermSize=256m
 spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
 com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Monday, July 6, 2015 at 12:59 AM
 To: Simeon Simeonov s...@swoop.com
 Cc: Denny Lee denny.g@gmail.com, Andy Huang 
 andy.hu...@servian.com.au, user user@spark.apache.org

 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   I have never seen issue like this. Setting PermGen size to 256m should
 solve the problem. Can you send me your test file and the command used to
 launch the spark shell or your application?

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote:

   Yin,

  With 512Mb PermGen, the process still hung and had to be kill -9ed.

  At 1Gb the spark shell  associated processes stopped hanging and
 started exiting with

  scala println(dfCount.first.getLong(0))
 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
 called with curMem=0, maxMem=2223023063
 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
 values in memory (estimated size 229.5 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184) called
 with curMem=235040, maxMem=2223023063
 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
 in memory on localhost:65464 (size: 19.7 KB, free: 2.1 GB)
 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from first
 at console:30
 java.lang.OutOfMemoryError: PermGen space
 Stopping spark context.
 Exception in thread main
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread main
 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
 broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
 GB)

  That did not change up until 4Gb of PermGen space and 8Gb for driver 
 executor each.

  I stopped at this point because the exercise started looking silly. It
 is clear that 1.4.0 is using memory in a substantially different manner.

  I'd be happy to share the test file so you can reproduce this in your
 own environment.

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Sunday, July 5, 2015 at 11:04 PM
 To: Denny Lee denny.g@gmail.com
 Cc: Andy Huang andy.hu...@servian.com.au, Simeon Simeonov 
 s...@swoop.com, user user@spark.apache.org
 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   Sim,

  Can you increase the PermGen size? Please let me know what is your
 setting when the problem disappears.

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote:

  I had run into the same problem where everything was working
 swimmingly with Spark 1.3.1.  When I switched to Spark 1.4, either by
 upgrading to Java8 (from Java7) or by knocking up the PermGenSize had
 solved my issue.  HTH!



  On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au
 wrote:

 We have hit the same issue in spark shell when registering a temp
 table. We observed it happening with those who had JDK 6. The problem went
 away after installing jdk 8. This was only for the tutorial materials which
 was about loading a parquet file.

  Regards
 Andy

 On Sat, Jul 4, 2015 at 2:54 AM, sim s...@swoop.com wrote:

 @bipin, in my case the error happens immediately in a fresh shell in
 1.4.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
  Sent from

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context
is primarily about very fast queries.  If you want to use real-time
queries, I would utilize Spark Streaming.  A couple of great resources on
this topic include Guest Lecture on Spark Streaming in Stanford CME 323:
Distributed Algorithms and Optimization
http://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford
and Recipes for Running Spark Streaming Applications in Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/
(from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki florian.spierc...@crisalid.com
wrote:

 Hello,

 I'm actually asking my self about performance of using Spark SQL with Hive
 to do real time analytics.
 I know that Hive has been created for batch processing, and Spark is use to
 do fast queries.

 But, use Spark SQL with Hive will allow me to do real time queries ? Or it
 just will make fastest queries but not real time.
 Should I use an other datawarehouse, like Hbase ?

 Thanks in advance for your time and consideration,
 Florian



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee
Hey Dean,
Sure, will take care of this.
HTH,
Denny

On Tue, Jul 7, 2015 at 10:07 Dean Wampler deanwamp...@gmail.com wrote:

 Here's our home page: http://www.meetup.com/Chicago-Spark-Users/

 Thanks,
 Dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com



Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Denny Lee
I had run into the same problem where everything was working swimmingly
with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
HTH!



On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au wrote:

 We have hit the same issue in spark shell when registering a temp table.
 We observed it happening with those who had JDK 6. The problem went away
 after installing jdk 8. This was only for the tutorial materials which was
 about loading a parquet file.

 Regards
 Andy

 On Sat, Jul 4, 2015 at 2:54 AM, sim s...@swoop.com wrote:

 @bipin, in my case the error happens immediately in a fresh shell in
 1.4.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
 f: 02 9376 0730| m: 0433221979



RE: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Andrew Lee
Hi Roberto,
I'm not an EMR person, but it looks like option -h is deploying the necessary 
dataneucleus JARs for you.The req for HiveContext is the hive-site.xml and 
dataneucleus JARs. As long as these 2 are there, and Spark is compiled with 
-Phive, it should work.
spark-shell runs in yarn-client mode. Not sure whether your other application 
is running under the same mode or a different one. Try specifying yarn-client 
mode and see if you get the same result as spark-shell.
From: roberto.coluc...@gmail.com
Date: Wed, 10 Jun 2015 14:32:04 +0200
Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
To: user@spark.apache.org

Hi!
I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, 
etc...). I make use of the AWS emr-bootstrap-action install-spark 
(https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the 
option/version -v1.3.1e so to get the latest Spark for EMR installed and 
available.
I also have a simple Spark Streaming driver in my project. Such driver is part 
of a larger Maven project: in the pom.xml I'm currently using   
[...]
scala.binary.version2.10/scala.binary.version
scala.version2.10.4/scala.version
java.version1.7/java.version
spark.version1.3.1/spark.version
hadoop.version2.4.1/hadoop.version
[]
dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming_${scala.binary.version}/artifactId
  version${spark.version}/version
  scopeprovided/scope
  exclusions
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
/exclusion
  /exclusions
/dependency


dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
  version${hadoop.version}/version
  scopeprovided/scope
/dependency


dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-hive_${scala.binary.version}/artifactId
  version${spark.version}/version
  scopeprovided/scope
/dependency

In fact, at compile and build time everything works just fine if, in my driver, 
I have:
-
val sparkConf = new SparkConf()  .setAppName(appName)  
.set(spark.local.dir, /tmp/ + appName)  
.set(spark.streaming.unpersist, true)  .set(spark.serializer, 
org.apache.spark.serializer.KryoSerializer)  
.registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, config.batchDuration)
import org.apache.spark.streaming.StreamingContext._












ssc.checkpoint(sparkConf.get(spark.local.dir) + checkpointRelativeDir)
 some input reading actions 
 some input transformation actions 
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._sqlContext.sql(an-HiveQL-query)
ssc.start()ssc.awaitTerminationOrTimeout(config.timeout)

--- 
What happens is that, right after have been launched, the driver fails with the 
exception:
15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at  myDriver.scala:  line of the sqlContext.sql(query) 
Caused by  some stuff 
Caused by: javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Thinking about a wrong Hive installation/configuration or libs/classpath 
definition, I SSHed into the cluster and launched a spark-shell. Excluding the 
app configuration and StreamingContext usage/definition, I then carried out all 
the actions listed in the driver implementation, in particular all the 
Hive-related ones and they all went through smoothly!

I also tried to use the optional -h argument 
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
 in the 

Re: SparkContext Threading

2015-06-06 Thread Lee McFadden
Hi Will,

That doesn't seem to be the case and was part of the source of my
confusion. The code currently in the run method of the runnable works
perfectly fine with the lambda expressions when it is invoked from the main
method. They also work when they are invoked from within a separate method
on the Transforms object.

It was only when putting that same code into another thread that the
serialization exception occurred.

Examples throughout the spark docs also use lambda expressions a lot -
surely those examples also would not work if this is always an issue with
lambdas?

On Sat, Jun 6, 2015, 12:21 AM Will Briggs wrbri...@gmail.com wrote:

 Hi Lee, it's actually not related to threading at all - you would still
 have the same problem even if you were using a single thread. See this
 section (
 https://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark)
 of the Spark docs.


 On June 5, 2015, at 5:12 PM, Lee McFadden splee...@gmail.com wrote:


 On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote:

 Your lambda expressions on the RDDs in the SecondRollup class are closing
 around the context, and Spark has special logic to ensure that all
 variables in a closure used on an RDD are Serializable - I hate linking to
 Quora, but there's a good explanation here:
 http://www.quora.com/What-does-Closure-cleaner-func-mean-in-Spark


 Ah, I see!  So if I broke out the lambda expressions into a method on an
 object it would prevent this issue.  Essentially, don't use lambda
 expressions when using threads.

 Thanks again, I appreciate the help.



SparkContext Threading

2015-06-05 Thread Lee McFadden
Hi all,

I'm having some issues finding any kind of best practices when attempting
to create Spark applications which launch jobs from a thread pool.

Initially I had issues passing the SparkContext to other threads as it is
not serializable.  Eventually I found that adding the @transient annotation
prevents a NotSerializableException.

```
class SecondRollup(@transient sc: SparkContext, connector:
CassandraConnector, scanPoint: DateTime) extends Runnable with Serializable
{
...
}
```

However, now I am running into a different exception:

```
15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.NoSuchMethodError:
org.apache.spark.executor.TaskMetrics.inputMetrics_$eq(Lscala/Option;)V
at
com.datastax.spark.connector.metrics.InputMetricsUpdater$.apply(InputMetricsUpdater.scala:61)
at
com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:196)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The documentation (https://spark.apache.org/docs/latest/job-scheduling.html)
explicitly states that jobs can be submitted by multiple threads but I seem
to be doing *something* incorrectly and haven't found any docs to point me
in the right direction.

Does anyone have any advice on how to get jobs submitted by multiple
threads?  The jobs are fairly simple and work when I run them serially, so
I'm not exactly sure what I'm doing wrong.

Thanks,

Lee


Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
You can see an example of the constructor for the class which executes a
job in my opening post.

I'm attempting to instantiate and run the class using the code below:

```
val conf = new SparkConf()
  .setAppName(appNameBase.format(Test))

val connector = CassandraConnector(conf)

val sc = new SparkContext(conf)

// Set up the threadpool for running Jobs.
val pool = Executors.newFixedThreadPool(10)

pool.execute(new SecondRollup(sc, connector, start))
```

There is some surrounding code that then waits for all the jobs entered
into the thread pool to complete, although it's not really required at the
moment as I am only submitting one job until I get this issue straightened
out :)

Thanks,

Lee

On Fri, Jun 5, 2015 at 11:50 AM Marcelo Vanzin van...@cloudera.com wrote:

 On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden splee...@gmail.com wrote:

 Initially I had issues passing the SparkContext to other threads as it is
 not serializable.  Eventually I found that adding the @transient annotation
 prevents a NotSerializableException.


 This is really puzzling. How are you passing the context around that you
 need to do serialization?

 Threads run all in the same process so serialization should not be needed
 at all.

 --
 Marcelo



  1   2   3   >