NullPointerException while joining two avro Hive tables

2017-02-04 Thread Понькин Алексей
Hi, I have a table in Hive(data is stored as avro files). Using python spark shell I am trying to join two datasets events = spark.sql('select * from mydb.events') intersect = events.where('attr2 in (5,6,7) and attr1 in (1,2,3)') intersect.count() But I am constantly receiving the following

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-12 Thread Понькин Алексей
Hi Charles, I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to illustrate the problem. https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala In a short - may be persist method is working but not like I expected. I thought

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Понькин Алексей
Great, thank you. Sorry for being so inattentive) Need to read docs carefully. -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 24.11.2015, 15:15, "Deng Ching-Mallete" : > Hi, > > If you wish to read from checkpoints, you need to use >

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Понькин Алексей
By call? > > On Tue, Sep 8, 2015 at 5:33 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote: >> Oh my, I implemented one directStream instead of union of three but it is >> still growing exponential with window method. >> >> -- >> Яндекс.Почта — надёжная п

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
et(topicA, topicB, topicB) ?  > You should be able to use a variable for that - read it from a config file, > whatever. > > If you're talking about the match statement, yeah you'd need to hardcode your > cases. > > On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.r

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
ases. > > On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote: >> Ok. I got it! >> But it seems that I need to hard code topic name. >> >> something like that? >> >> val source = KafkaUtils.createDirectStream[Array[Byte], Arr

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
The thing is, that these topics contain absolutely different AVRO objects(Array[Byte]) that I need to deserialize to different Java(Scala) objects, filter and then map to tuple (String, String). So i have 3 streams with different avro object in there. I need to cast them(using some business

Re: [streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Понькин Алексей
OK, I got it. When I use 'yarn logs -applicationId ' command everything appears in right place. Thank you! -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 07.09.2015, 01:44, "Gerard Maas" : > You need to take into consideration 'where'

Re: [spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-05 Thread Понькин Алексей
Hi Cody, Thank you for quick response. The problem was that my application did not have enough resources(all executors were busy). So spark decided to run these tasks sequentially. When I add more executors for application everything goes fine. Thank you anyway. P.S. BTW thanks you for great

Re: Spark Number of Partitions Recommendations

2015-08-02 Thread Понькин Алексей
Yes, I forgot to mention I chose prime number as a modulo for hash function because my keys are usually strings and spark calculates particular partitiion using key hash(see HashPartitioner.scala) So, to avoid big number of collisions(when many keys located in few partition) it is common to use