Twitter streaming with apache spark stream only a small amount of tweets

2015-07-25 Thread Zoran Jeremic
Hi, I've implemented Twitter streaming as in the code given at the bottom of email. It finds some tweets based on the hashtags I'm following. However, it seems that a large amount of tweets is missing. I've tried to post some tweets that I'm following in the application, and none of them was

Re: Parquet writing gets progressively slower

2015-07-25 Thread Michael Kelly
Thanks for the suggestion Cheng, I will try that today. Are there any implications when reading the parquet data if there are no summary files present? Michael On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian lian.cs@gmail.com wrote: The time is probably spent by ParquetOutputFormat.commitJob.

Re: Writing binary files in Spark

2015-07-25 Thread Akhil Das
Its been added from spark 1.1.0 i guess https://issues.apache.org/jira/browse/SPARK-1161 Thanks Best Regards On Sat, Jul 25, 2015 at 12:06 AM, Oren Shpigel o...@yowza3d.com wrote: Sorry, I didn't mention I'm using the Python API, which doesn't have the saveAsObjectFiles method. Is there any

Re: 1.4.0 classpath issue with spark-submit

2015-07-25 Thread Michal Haris
The thing is that the class it is complaining about is part of the spark assembly jar, not in my extra jar. The assembly jar was compiled with -Phive which is proven by the fact that it works with the same SPARK_HOME when run as shell. On 23 July 2015 at 17:33, Akhil Das

Re: Insert data into a table

2015-07-25 Thread sim
I don't think INSERT INTO is supported. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Insert-data-into-a-table-tp21898p23990.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Best practice for transforming and storing from Spark to Mongo/HDFS

2015-07-25 Thread nibiau
Hello, I am new user of Spark, and need to know what could be the best practice to do the following scenario : - Spark Streaming receives XML messages from Kafka - Spark transforms each message of the RDD (xml2json + some enrichments) - Spark store the transformed/enriched messages inside

spark-dataflow + Spark Streaming + Kafka

2015-07-25 Thread Albert Strasheim
Hello all New Spark user here. We've been looking at the Spark ecosystem to build some new parts of our log processing pipeline. The spark-dataflow project looks especially interesting. The windowing and triggers concepts look like a good fit for what we need to do: our log data going into

Re: Best practice for transforming and storing from Spark to Mongo/HDFS

2015-07-25 Thread Cody Koeninger
Use foreachPartition and batch the writes On Sat, Jul 25, 2015 at 9:14 AM, nib...@free.fr wrote: Hello, I am new user of Spark, and need to know what could be the best practice to do the following scenario : - Spark Streaming receives XML messages from Kafka - Spark transforms each message

Parallelism of Custom receiver in spark

2015-07-25 Thread anshu shukla
1 - How to increase the level of *parallelism in spark streaming custom RECEIVER* . 2 - Will ssc.receiverstream(/**anything //) will *delete the data stored in spark memory using store(s) * logic . -- Thanks Regards, Anshu Shukla

[Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files

2015-07-25 Thread Roberto Coluccio
Hello Spark community, I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext, in particular SELECTing data from an EXTERNAL TABLE backed on S3. Such table has dynamic partitions and contains *hundreds

Multiple operations on same DStream in Spark Streaming

2015-07-25 Thread foobar
Hi I'm working with Spark Streaming using scala, and trying to figure out the following problem. In my DStream[(int, int)], each record is an int pair tuple. For each batch, I would like to filter out all records with first integer below average of first integer in this batch, and for all records

Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-25 Thread Peter Leventis
I just wanted an easy step by step guide as to exactly what version of what ever to download for a Proof of Concept installation of Apache Spark on Windows 7. I have spent quite some time following a number of different recipes to no avail. I have tried about 10 different permutations to date. I

Question abt serialization

2015-07-25 Thread tog
Hi I have been using Spark for quite some time using either scala or python. I wanted to give a try to groovy through scripts for small tests. Unfortunately I get the following exception (using that simple script https://gist.github.com/galleon/d6540327c418aa8a479f) Is there anything I am not

ReceiverStream SPARK not able to cope up with 20,000 events /sec .

2015-07-25 Thread anshu shukla
My eventGen is emitting 20,000 events/sec ,and I am using store(s1) in receive() method to push data to receiverStream . But this logic is working fine for upto 4000 events/sec and no batch are seen emitting for larger rate . *CODE:TOPOLOGY -* *JavaDStreamString sourcestream =