Applicative logs on Yarn

2015-10-08 Thread nibiau
Hello, I submit spark streaming inside Yarn, I have configured yarn to generate custom logs. It works fine and yarn aggregate very well the logs inside HDFS, nevertheless the log files are only usable via "yarn logs" command. I would prefer to be able to navigate inside via hdfs command like a

Re: Spark Streaming over YARN

2015-10-04 Thread nibiau
4 partitions. - Mail original - De: "Dibyendu Bhattacharya" À: "Nicolas Biau" Cc: "Cody Koeninger" , "user" Envoyé: Dimanche 4 Octobre 2015 16:51:38 Objet: Re: Spark Streaming over YARN How

Re: Spark Streaming over YARN

2015-10-04 Thread nibiau
Hello, I am using https://github.com/dibbhatt/kafka-spark-consumer I specify 4 receivers in the ReceiverLauncher , but in YARN console I can see one node receiving the kafka flow. (I use spark 1.3.1) Tks Nicolas - Mail original - De: "Dibyendu Bhattacharya"

Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, Finally Hive is not a solution as I cannot update the data. And for archive file I think it would be the same issue. Any other solutions ? Nicolas - Mail original - De: nib...@free.fr À: "Brett Antonides" Cc: user@spark.apache.org Envoyé: Vendredi 2 Octobre

Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, So, does Hive is a solution for my need : - I receive small messages (10KB) identified by ID (product ID for example) - Each message I receive is the last picture of my product ID, so I just want basically to store last picture products inside HDFS in order to process batch on it later.

RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, Thanks if I understand correctly Hive can be a usable to my context ? Nicolas Envoyé depuis mon appareil mobile SamsungJörn Franke a écrit :If you use transactional tables in hive together with insert, update, delete then it does the "concatenate " for you

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Thanks a lot, why you said "the most recent version" ? - Mail original - De: "Jörn Franke" <jornfra...@gmail.com> À: "nibiau" <nib...@free.fr> Cc: banto...@gmail.com, user@spark.apache.org Envoyé: Samedi 3 Octobre 2015 13:56:43 Objet: Re: RE : Re:

Re: HDFS small file generation problem

2015-10-02 Thread nibiau
Hello, Yes but : - In the Java API I don't find a API to create a HDFS archive - As soon as I receive a message (with messageID) I need to replace the old existing file by the new one (name of file being the messageID), is it possible with archive ? Tks Nicolas - Mail original - De:

Spark Streaming over YARN

2015-10-02 Thread nibiau
Hello, I have a job receiving data from kafka (4 partitions) and persisting data inside MongoDB. It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) only on node is receiving all the kafka partitions and only one node is processing my RDD treatment (foreach function)

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau
>From my understanding as soon as I use YARN I don't need to use parrallelisme >(at least for RDD treatment) I don't want to use direct stream as I have to manage the offset positionning (in order to be able to start from the last offset treated after a spark job failure) - Mail original

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau
Ok so if I set for example 4 receivers (number of nodes), how RDD will be distributed over the nodes/core. For example in my example I have 4 nodes (with 2 cores) Tks Nicolas - Mail original - De: "Dibyendu Bhattacharya" À: nib...@free.fr Cc: "Cody

Re: HDFS small file generation problem

2015-10-02 Thread nibiau
Ok thanks, but can I also update data instead of insert data ? - Mail original - De: "Brett Antonides" À: user@spark.apache.org Envoyé: Vendredi 2 Octobre 2015 18:18:18 Objet: Re: HDFS small file generation problem I had a very similar problem and solved it

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau
Sorry, I just said that I NEED to manage offsets, so in case of Kafka Direct Stream , how can I handle this ? Update Zookeeper manually ? why not but any other solutions ? - Mail original - De: "Cody Koeninger" À: "Nicolas Biau" Cc: "user"

HDFS small file generation problem

2015-09-27 Thread nibiau
Hello, I'm still investigating my small file generation problem generated by my Spark Streaming jobs. Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), and I have to store them inside HDFS in order to treat them by PIG jobs on-demand. The problem is the fact that I

Receiver and Parallelization

2015-09-25 Thread nibiau
Hello, I used a custom receiver in order to receive JMS messages from MQ Servers. I want to benefit of Yarn cluster, my questions are : - Is it possible to have only one node receiving JMS messages and parralelize the RDD over all the cluster nodes ? - Is it possible to parallelize also the

Spark Streaming distributed job

2015-09-21 Thread nibiau
Hello, Please could you explain me what is exactly distributed when I launch a spark streaming job over YARN cluster ? My code is something like : JavaDStream customReceiverStream = ssc.receiverStream(streamConfig.getJmsReceiver()); JavaDStream incoming_msg = customReceiverStream.map(

Distribute JMS receiver jobs on YARN

2015-09-17 Thread nibiau
Hello, I have spark application with a JMS receiver. Basically my application does : JavaDStream incoming_msg = customReceiverStream.map( new Function() { public String

Re: Small File to HDFS

2015-09-03 Thread nibiau
My main question in case of HAR usage is , is it possible to use Pig on it and what about performances ? - Mail original - De: "Jörn Franke" À: nib...@free.fr, user@spark.apache.org Envoyé: Jeudi 3 Septembre 2015 15:54:42 Objet: Re: Small File to HDFS Store

Re: Small File to HDFS

2015-09-03 Thread nibiau
HAR archive seems a good idea , but just a last question to be sure to do the best choice : - Is it possible to override (remove/replace) a file inside the HAR ? Basically the name of my small files will be the keys of my records , and sometimes I will need to replace the content of a file by a

Re: Small File to HDFS

2015-09-03 Thread nibiau
Ok but so some questions : - Sometimes I have to remove some messages from HDFS (cancel/replace cases) , is it possible ? - In the case of a big zip file, is it possible to easily process Pig on it directly ? Tks Nicolas - Mail original - De: "Tao Lu" À:

Small File to HDFS

2015-09-02 Thread nibiau
Hello, I'am currently using Spark Streaming to collect small messages (events) , size being <50 KB , volume is high (several millions per day) and I have to store those messages in HDFS. I understood that storing small files can be problematic in HDFS , how can I manage it ? Tks Nicolas

Re: Small File to HDFS

2015-09-02 Thread nibiau
Hi, I already store them in MongoDB in parralel for operational access and don't want to add an other database in the loop Is it the only solution ? Tks Nicolas - Mail original - De: "Ted Yu" À: nib...@free.fr Cc: "user" Envoyé: Mercredi 2

Best practice for transforming and storing from Spark to Mongo/HDFS

2015-07-25 Thread nibiau
Hello, I am new user of Spark, and need to know what could be the best practice to do the following scenario : - Spark Streaming receives XML messages from Kafka - Spark transforms each message of the RDD (xml2json + some enrichments) - Spark store the transformed/enriched messages inside

Best practice to update a MongoDB document from Sparks

2015-05-28 Thread nibiau
Hello, I'm evaluating Spark/SparkStreaming . I use SparkStreaming to receive messages from a Kafka topic. As soon as I have a JavaReceiverInputDStream , I have to treat each message, for each one I have to search in MongoDB to find if a document does exist. If I found the document I have to