Hi,
I have a table in Hive(data is stored as avro files).
Using python spark shell I am trying to join two datasets
events = spark.sql('select * from mydb.events')
intersect = events.where('attr2 in (5,6,7) and attr1 in (1,2,3)')
intersect.count()
But I am constantly receiving the following
Hi Charles,
I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to
illustrate the problem.
https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala
In a short - may be persist method is working but not like I expected.
I thought
Great, thank you.
Sorry for being so inattentive) Need to read docs carefully.
--
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1
24.11.2015, 15:15, "Deng Ching-Mallete" :
> Hi,
>
> If you wish to read from checkpoints, you need to use
>
By call?
>
> On Tue, Sep 8, 2015 at 5:33 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote:
>> Oh my, I implemented one directStream instead of union of three but it is
>> still growing exponential with window method.
>>
>> --
>> Яндекс.Почта — надёжная п
et(topicA, topicB, topicB) ?
> You should be able to use a variable for that - read it from a config file,
> whatever.
>
> If you're talking about the match statement, yeah you'd need to hardcode your
> cases.
>
> On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.r
ases.
>
> On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote:
>> Ok. I got it!
>> But it seems that I need to hard code topic name.
>>
>> something like that?
>>
>> val source = KafkaUtils.createDirectStream[Array[Byte], Arr
The thing is, that these topics contain absolutely different AVRO
objects(Array[Byte]) that I need to deserialize to different Java(Scala)
objects, filter and then map to tuple (String, String). So i have 3 streams
with different avro object in there. I need to cast them(using some business
OK, I got it.
When I use 'yarn logs -applicationId ' command everything appears in
right place.
Thank you!
--
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1
07.09.2015, 01:44, "Gerard Maas" :
> You need to take into consideration 'where'
Hi Cody,
Thank you for quick response.
The problem was that my application did not have enough resources(all executors
were busy). So spark decided to run these tasks sequentially. When I add more
executors for application everything goes fine.
Thank you anyway.
P.S. BTW thanks you for great
Yes, I forgot to mention
I chose prime number as a modulo for hash function because my keys are usually
strings and spark calculates particular partitiion using key hash(see
HashPartitioner.scala) So, to avoid big number of collisions(when many keys
located in few partition) it is common to use
10 matches
Mail list logo