Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
Nice article about Parquet *with* Avro :

   - https://dzone.com/articles/understanding-how-parquet
   - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Nice video from the good folks of Cloudera for the *differences* between
"Avrow" and Parquet

   - https://www.youtube.com/watch?v=AY1dEfyFeHc


2016-03-04 7:12 GMT+01:00 Koert Kuipers <ko...@tresata.com>:

> well can you use orc without bringing in the kitchen sink of dependencies
> also known as hive?
>
> On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim <ilike...@gmail.com> wrote:
>
>> How about ORC? I have experimented briefly with Parquet and ORC, and I
>> liked the fact that ORC has its schema within the file, which makes it
>> handy to work with any other tools.
>>
>> Jong Wook
>>
>> On 3 March 2016 at 23:29, Don Drake <dondr...@gmail.com> wrote:
>>
>>> My tests show Parquet has better performance than Avro in just about
>>> every test.  It really shines when you are querying a subset of columns in
>>> a wide table.
>>>
>>> -Don
>>>
>>> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann <tim.sp...@airisdata.com>
>>> wrote:
>>>
>>>> Which format is the best format for SparkSQL adhoc queries and general
>>>> data storage?
>>>>
>>>> There are lots of specialized cases, but generally accessing some but
>>>> not all the available columns with a reasonable subset of the data.
>>>>
>>>> I am learning towards Parquet as it has great support in Spark.
>>>>
>>>> I also have to consider any file on HDFS may be accessed from other
>>>> tools like Hive, Impala, HAWQ.
>>>>
>>>> Suggestions?
>>>> —
>>>> airis.DATA
>>>> Timothy Spann, Senior Solutions Architect
>>>> C: 609-250-5894
>>>> http://airisdata.com/
>>>> http://meetup.com/nj-datascience
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>>> 800-733-2143
>>>
>>
>>
>


-- 

Paul Leclercq | Data engineer


 paul.lecle...@tabmo.io  |  http://www.tabmo.fr/


Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-23 Thread Paul Leclercq
I successfully processed my data by resetting manually my topic offsets on
ZK.

If it may help someone, here's my steps :

Make sure you stop all your consumers before doing that, otherwise they
overwrite the new offsets you wrote

set /consumers/{yourConsumerGroup}/offsets/{yourFancyTopic}/{partitionId}
{newOffset}


Source : https://metabroadcast.com/blog/resetting-kafka-offsets

2016-02-22 11:55 GMT+01:00 Paul Leclercq <paul.lecle...@tabmo.io>:

> Thanks for your quick answer.
>
> If I set "auto.offset.reset" to "smallest" as for KafkaParams like this
>
> val kafkaParams = Map[String, String](
>  "metadata.broker.list" -> brokers,
>  "group.id" -> groupId,
>  "auto.offset.reset" -> "smallest"
> )
>
> And then use :
>
> val streams = KafkaUtils.createStream(ssc, kafkaParams, kafkaTopics, 
> StorageLevel.MEMORY_AND_DISK_SER_2)
>
> My fear is that, every time I deploy a new version, the all consumer's topics 
> are going to be read from the beginning, but as said in Kafka's documentation
>
> auto.offset.reset default : largest
>
> What to do when there* is no initial offset in ZooKeeper* or if an offset is 
> out of range:
> * smallest : automatically reset the offset to the smallest offset
>
> So I will go for this option the next time I need to process a new topic 
>
> To fix my problem, as the topic as already been processed and registred in 
> ZK, I will use a directStream from smallest and remove all DB inserts of this 
> topic, and restart a "normal" stream when the lag will be caught up.
>
>
> 2016-02-22 10:57 GMT+01:00 Saisai Shao <sai.sai.s...@gmail.com>:
>
>> You could set this configuration "auto.offset.reset" through parameter
>> "kafkaParams" which is provided in some other overloaded APIs of
>> createStream.
>>
>> By default Kafka will pick data from latest offset unless you explicitly
>> set it, this is the behavior Kafka, not Spark.
>>
>> Thanks
>> Saisai
>>
>> On Mon, Feb 22, 2016 at 5:52 PM, Paul Leclercq <paul.lecle...@tabmo.io>
>> wrote:
>>
>>> Hi,
>>>
>>> Do you know why, with the receiver approach
>>> <http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-1-receiver-based-approach>
>>> and a *consumer group*, a new topic is not read from the beginning but
>>> from the lastest ?
>>>
>>> Code example :
>>>
>>>  val kafkaStream = KafkaUtils.createStream(streamingContext,
>>>  [ZK quorum], [consumer group id], [per-topic number of Kafka 
>>> partitions to consume])
>>>
>>>
>>> Is there a way to tell *only for new topic *to read from the beginning ?
>>>
>>> From Confluence FAQ
>>>
>>>> Alternatively, you can configure the consumer by setting
>>>> auto.offset.reset to "earliest" for the new consumer in 0.9 and "smallest"
>>>> for the old consumer.
>>>
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whydoesmyconsumernevergetanydata?
>>>
>>> Thanks
>>> --
>>>
>>> Paul Leclercq
>>>
>>
>>
>
>
> --
>
> Paul Leclercq | Data engineer
>
>
>  paul.lecle...@tabmo.io  |  http://www.tabmo.fr/
>



-- 

Paul Leclercq | Data engineer


 paul.lecle...@tabmo.io  |  http://www.tabmo.fr/


Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
Thanks for your quick answer.

If I set "auto.offset.reset" to "smallest" as for KafkaParams like this

val kafkaParams = Map[String, String](
 "metadata.broker.list" -> brokers,
 "group.id" -> groupId,
 "auto.offset.reset" -> "smallest"
)

And then use :

val streams = KafkaUtils.createStream(ssc, kafkaParams, kafkaTopics,
StorageLevel.MEMORY_AND_DISK_SER_2)

My fear is that, every time I deploy a new version, the all consumer's
topics are going to be read from the beginning, but as said in Kafka's
documentation

auto.offset.reset default : largest

What to do when there* is no initial offset in ZooKeeper* or if an
offset is out of range:
* smallest : automatically reset the offset to the smallest offset

So I will go for this option the next time I need to process a new topic 

To fix my problem, as the topic as already been processed and
registred in ZK, I will use a directStream from smallest and remove
all DB inserts of this topic, and restart a "normal" stream when the
lag will be caught up.


2016-02-22 10:57 GMT+01:00 Saisai Shao <sai.sai.s...@gmail.com>:

> You could set this configuration "auto.offset.reset" through parameter
> "kafkaParams" which is provided in some other overloaded APIs of
> createStream.
>
> By default Kafka will pick data from latest offset unless you explicitly
> set it, this is the behavior Kafka, not Spark.
>
> Thanks
> Saisai
>
> On Mon, Feb 22, 2016 at 5:52 PM, Paul Leclercq <paul.lecle...@tabmo.io>
> wrote:
>
>> Hi,
>>
>> Do you know why, with the receiver approach
>> <http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-1-receiver-based-approach>
>> and a *consumer group*, a new topic is not read from the beginning but
>> from the lastest ?
>>
>> Code example :
>>
>>  val kafkaStream = KafkaUtils.createStream(streamingContext,
>>  [ZK quorum], [consumer group id], [per-topic number of Kafka partitions 
>> to consume])
>>
>>
>> Is there a way to tell *only for new topic *to read from the beginning ?
>>
>> From Confluence FAQ
>>
>>> Alternatively, you can configure the consumer by setting
>>> auto.offset.reset to "earliest" for the new consumer in 0.9 and "smallest"
>>> for the old consumer.
>>
>>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whydoesmyconsumernevergetanydata?
>>
>> Thanks
>> --
>>
>> Paul Leclercq
>>
>
>


-- 

Paul Leclercq | Data engineer


 paul.lecle...@tabmo.io  |  http://www.tabmo.fr/


Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
Hi,

Do you know why, with the receiver approach
<http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-1-receiver-based-approach>
and a *consumer group*, a new topic is not read from the beginning but from
the lastest ?

Code example :

 val kafkaStream = KafkaUtils.createStream(streamingContext,
 [ZK quorum], [consumer group id], [per-topic number of Kafka
partitions to consume])


Is there a way to tell *only for new topic *to read from the beginning ?

>From Confluence FAQ

> Alternatively, you can configure the consumer by setting auto.offset.reset
> to "earliest" for the new consumer in 0.9 and "smallest" for the old
> consumer.


https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whydoesmyconsumernevergetanydata?

Thanks
-- 

Paul Leclercq


Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-20 Thread Paul Leclercq
Hi Raghvendra and Spark users,

I also have trouble activating my stand by master when my first master is
shutdown (via a ./sbin/stop-master.sh or via a instance shut down) and just
want to share with you my thoughts.

To answer your question Raghvendra, in *spark-env.sh*, if 2 IPs are set
for SPARK_MASTER_IP(SPARK_MASTER_IP='W.X.Y.Z,A.B.C.D'), the standalone
cluster cannot be launched.

So I only use only one IP there, as the Spark context can know other
masters with a other way, as written in the Standalone Zookeeper HA

doc, "you might start your SparkContext pointing to
spark://host1:port1,host2:port2"

In my opinion, we should not have to set a SPARK_MASTER_IP as this is
stored in ZooKeeper :

you can launch multiple Masters in your cluster connected to the same
> ZooKeeper instance. One will be elected “leader” and the others will remain
> in standby mode.

When starting up, an application or Worker needs to be able to find and
> register with the current lead Master. Once it successfully registers,
> though, it is “in the system” (i.e., stored in ZooKeeper).

 -
http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper

As I understand it, after a ./sbin/stop-master.sh on both master, a master
will be elected, and the other will be stand by.
To launch the workers, we can use ./sbin/start-slave.sh
spark://MASTER_ELECTED_IP:7077
I don't think if we can use the ./sbin/start-all.sh that use the salve file
to launch workers and masters as we cannot set 2 master IPs inside
spark-env.sh

My SPARK_DAEMON_JAVA_OPTS content :

SPARK_DAEMON_JAVA_OPTS='-Dspark.deploy.recoveryMode="ZOOKEEPER"
> -Dspark.deploy.zookeeper.url="ZOOKEEPER_IP:2181"
> -Dspark.deploy.zookeeper.dir="/spark"'


A good thing to check if everything went OK is the folder /spark on the
ZooKeeper server. I could not find it on my server.

Thanks for reading,

Paul


2016-01-19 22:12 GMT+01:00 Raghvendra Singh :

> Hi, there is one question. In spark-env.sh should i specify all masters
> for parameter SPARK_MASTER_IP. I've set SPARK_DAEMON_JAVA_OPTS already
> with zookeeper configuration as specified in spark documentation.
>
> Thanks & Regards
> Raghvendra
>
> On Wed, Jan 20, 2016 at 1:46 AM, Raghvendra Singh <
> raghvendra.ii...@gmail.com> wrote:
>
>> Here's the complete master log on reproducing the error
>> http://pastebin.com/2YJpyBiF
>>
>> Regards
>> Raghvendra
>>
>> On Wed, Jan 20, 2016 at 12:38 AM, Raghvendra Singh <
>> raghvendra.ii...@gmail.com> wrote:
>>
>>> Ok I Will try to reproduce the problem. Also I don't think this is an
>>> uncommon problem I am searching for this problem on Google for many days
>>> and found lots of questions but no answers.
>>>
>>> Do you know what kinds of settings spark and zookeeper allow for
>>> handling time outs during leader election etc. When one is down.
>>>
>>> Regards
>>> Raghvendra
>>> On 20-Jan-2016 12:28 am, "Ted Yu"  wrote:
>>>
 Perhaps I don't have enough information to make further progress.

 On Tue, Jan 19, 2016 at 10:55 AM, Raghvendra Singh <
 raghvendra.ii...@gmail.com> wrote:

> I currently do not have access to those logs but there were only about
> five lines before this error. They were the same which are present usually
> when everything works fine.
>
> Can you still help?
>
> Regards
> Raghvendra
> On 18-Jan-2016 8:50 pm, "Ted Yu"  wrote:
>
>> Can you pastebin master log before the error showed up ?
>>
>> The initial message was posted for Spark 1.2.0
>> Which release of Spark / zookeeper do you use ?
>>
>> Thanks
>>
>> On Mon, Jan 18, 2016 at 6:47 AM, doctorx 
>> wrote:
>>
>>> Hi,
>>> I am facing the same issue, with the given error
>>>
>>> ERROR Master:75 - Leadership has been revoked -- master shutting
>>> down.
>>>
>>> Can anybody help. Any clue will be useful. Should i change something
>>> in
>>> spark cluster or zookeeper. Is there any setting in spark which can
>>> help me?
>>>
>>> Thanks & Regards
>>> Raghvendra
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

>>
>


Re: Spark streaming job hangs

2015-12-01 Thread Paul Leclercq
You might not have enough cores to process data from Kafka


> When running a Spark Streaming program locally, do not use “local” or
> “local[1]” as the master URL. Either of these means that only one thread
> will be used for running tasks locally. If you are using a input DStream
> based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
> thread will be used to run the receiver, leaving no thread for processing
> the received data. *Hence, when running locally, always use “local[n]” as
> the master URL, ​*where n > number of receivers to run (see Spark
> Properties for information on how to set the master).*


 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers>

2015-12-01 7:13 GMT+01:00 Cassa L <lcas...@gmail.com>:

> Hi,
>  I am reading data from Kafka into spark. It runs fine for sometime but
> then hangs forever with following output. I don't see and errors in logs.
> How do I debug this?
>
> 2015-12-01 06:04:30,697 [dag-scheduler-event-loop] INFO
> (Logging.scala:59) - Adding task set 19.0 with 4 tasks
> 2015-12-01 06:04:30,872 [pool-13-thread-1] INFO  (Logging.scala:59) -
> Disconnected from Cassandra cluster: APG DEV Cluster
> 2015-12-01 06:04:35,060 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 1448949875000 ms
> 2015-12-01 06:04:40,054 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 144894988 ms
> 2015-12-01 06:04:45,034 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 1448949885000 ms
> 2015-12-01 06:04:50,100 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 144894989 ms
> 2015-12-01 06:04:55,064 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 1448949895000 ms
> 2015-12-01 06:05:00,125 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 144894990 ms
>
>
> Thanks
> LCassa
>



-- 

Paul Leclercq | Data engineer


 paul.lecle...@tabmo.io  |  http://www.tabmo.fr/