Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-28 Thread ayan guha
Probably a naive question: can you try the same in hive CLI and see if your SQL is working? Looks like hive thing to me as spark is faithfully delegating the query to hive. On 29 May 2015 03:22, Abhishek Tripathi trackissue...@gmail.com wrote: Hi , I'm using CDH5.4.0 quick start VM and tried

spark mlib variance analysis

2015-05-28 Thread rafac
I have a simple problem: i got mean number of people on one place by hour(time-series like), and now i want to know if the weather condition have impact on the mean number. I would do it with variance analysis like anova in spss or analysing the resultant regression model summary How is it

Re: Batch aggregation by sliding window + join

2015-05-28 Thread ayan guha
Which version of spark? In 1.4 window queries will show up for these kind of scenarios. 1 thing I can suggest is keep daily aggregates materialised and partioned by key and sorted by key-day combination using repartitionandsort method. It allows you to use custom partitioner and custom sorter.

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Which would imply that if there was a load manager type of service, it could signal to the driver(s) that they need to acquiesce, i.e. process what's at hand and terminate. Then bring up a new machine, then restart the driver(s)... Same deal with removing machines from the cluster. Send a signal

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
Thanks Sandy- I was digging through the code in the deploy.yarn.Client and literally found that property right before I saw your reply. I'm on 1.2.x right now which doesn't have the property. I guess I need to update sooner rather than later. On Thu, May 28, 2015 at 3:56 PM, Sandy Ryza

UDF accessing hive struct array fails with buffer underflow from kryo

2015-05-28 Thread yluo
Hi all, I'm using Spark 1.3.1 with Hive 0.13.1. When running a UDF accessing a hive struct array the query fails with: Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. Serialization trace: fieldName

spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-28 Thread roy
hi, Suddenly spark jobs started failing with following error Exception in thread main java.io.FileNotFoundException: /user/spark/applicationHistory/application_1432824195832_1275.inprogress (No such file or directory) full trace here [21:50:04 x...@hadoop-client01.dev:~]$ spark-submit --class

Adding an indexed column

2015-05-28 Thread Cesar Flores
Assuming that I have the next data frame: flag | price -- 1|47.808764653746 1|47.808764653746 1|31.9869279512204 1|47.7907893713564 1|16.7599200038239 1|16.7599200038239 1|20.3916014172137 How can I create a data frame with an extra indexed column

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Mulugeta Mammo
Thanks for the valuable information. The blog states: The cores property controls the number of concurrent tasks an executor can run. --executor-cores 5 means that each executor can run a maximum of five tasks at the same time. So, I guess the max number of executor-cores I can assign is the

Fwd: [Streaming] Configure executor logging on Mesos

2015-05-28 Thread Tim Chen
13:36:22.958067 26890 exec.cpp:206] Executor registered on slave 20150528-063307-780930314-5050-8152-S5 Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties So, no matter what I provide

Hyperthreading

2015-05-28 Thread Mulugeta Mammo
Hi guys, Does the SPARK_EXECUTOR_CORES assume Hyper threading? For example, if I have 4 cores with 2 threads per core, should the SPARK_EXECUTOR_CORES be 4*2 = 8 or just 4? Thanks,

RE: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Evo Eftimov
I don’t think the number of CPU cores controls the “number of parallel tasks”. The number of Tasks corresponds first and foremost to the number of (Dstream) RDD Partitions The Spark documentation doesn’t mention what is meant by “Task” in terms of Standard Multithreading Terminology ie a

yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm noticing the jvm that fires up to allocate the resources, etc... is not going away after the application master and executors have been allocated. Instead, it just sits there printing 1 second status updates to the console.

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Sandy Ryza
Hi Corey, As of this PR https://github.com/apache/spark/pull/5297/files, this can be controlled with spark.yarn.submit.waitAppCompletion. -Sandy On Thu, May 28, 2015 at 11:48 AM, Corey Nolet cjno...@gmail.com wrote: I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm

Re: PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Davies Liu
Could you try to comment out some lines in `extract_sift_features_opencv` to find which line cause the crash? If the bytes came from sequenceFile() is broken, it's easy to crash a C library in Python (OpenCV). On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga sammiest...@gmail.com wrote: Hi

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Ruslan Dautkhanov
It's not only about cores. Keep in mind spark.executor.cores also affects available memeory for each task: From http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ The memory available to each task is (spark.executor.memory * spark.shuffle.memoryFraction

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Andrew Or
Hi all, As the author of the dynamic allocation feature I can offer a few insights here. Gerard's explanation was both correct and concise: dynamic allocation is not intended to be used in Spark streaming at the moment (1.4 or before). This is because of two things: (1) Number of receivers is

Batch aggregation by sliding window + join

2015-05-28 Thread igor.berman
Hi, I have a batch daily job that computes daily aggregate of several counters represented by some object. After daily aggregation is done, I want to compute block of 3 days aggregation(3,7,30 etc) To do so I need to add new daily aggregation to the current block and then subtract from current

Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-28 Thread Abhishek Tripathi
Hi , I'm using CDH5.4.0 quick start VM and tried to build Spark with Hive compatibility so that I can run Spark sql and access temp table remotely. I used below command to build Spark, it was build successful but when I tried to access Hive data from Spark sql, I get error. Thanks, Abhi

Twitter Streaming HTTP 401 Error

2015-05-28 Thread tracynj
I am working on the Databricks Reference applications, porting them to my company's platform, and extending them to emit RDF. I have already gotten them working with the extension on EC2, and have the Log Analyzer application working on our platform. But the Twitter Language Classifier application

Re: How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Nan Xiao
Hi Somnath, Is there a step-by-step instruction about using Eclipse to develop Spark application? I think many people need them. Thanks! Best Regards Nan Xiao On Thu, May 28, 2015 at 3:15 PM, Somnath Pandeya somnath_pand...@infosys.com wrote: Try scala eclipse plugin to eclipsify spark project

Re: Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Mohit Jaggi
I have used VoltDB and Spark. The use cases for the two are quite different. VoltDB is intended for transactions and also supports queries on the same(custom to voltdb) store. Spark(SQL) is NOT suitable for transactions; it is designed for querying immutable data (which may exist in several

Registering Custom metrics [Spark-Streaming-monitoring]

2015-05-28 Thread Snehal Nagmote
Hello All, I am using spark streaming 1.3 . I want to capture few custom metrics based on accumulators, I followed somewhat similar to this approach , val instrumentation = new SparkInstrumentation(example.metrics) * val numReqs = sc.accumulator(0L) *

Re: Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Ashish Mukherjee
Hi Mohit, Thanks for your reply. If my use case is purely querying read-only data (no transaction scenarios), at what scale is one of them a better option than the other? I am aware that for scale which can be supported on a single node, VoltDB is a better choice. However, when the scale grows

Spark Cassandra

2015-05-28 Thread lucas
Hello, I am trying to save data from spark to Cassandra. So I have an ScalaESRDD (because i take data from elasticsearch) that contains a lot of key/values like this : (AU16r4o_kbhIuSky3zFO , Map(@timestamp - 2015-05-21T21:35:54.035Z, timestamp - 2015-05-21 23:35:54,035, loglevel - INFO,

How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Nan Xiao
Hi all, I want to use Eclipse on Windows to build Spark environment, but find the reference page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup) doesn't contain any guide about Eclipse. Could anyone give tutorials or links about how to using

RE: How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Somnath Pandeya
Try scala eclipse plugin to eclipsify spark project and import spark as eclipse project -Somnath -Original Message- From: Nan Xiao [mailto:xiaonan830...@gmail.com] Sent: Thursday, May 28, 2015 12:32 PM To: user@spark.apache.org Subject: How to use Eclipse on Windows to build Spark

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
I do this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master public key to .ssh/authorized_keys - Add the slaves internal IP to the master's conf/slaves file - do sbin/start-all.sh and it will

Adding slaves on spark standalone on ec2

2015-05-28 Thread nizang
hi, I'm working on spark standalone system on ec2, and I'm having problems on resizing the cluster (meaning - adding or removing slaves). In the basic ec2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html), there's only script for lunching the cluster, not adding slaves to it. On the

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
Hi Zhang, Could you paste your code in a gist? Not sure what you are doing inside the code to fill up memory. Thanks Best Regards On Thu, May 28, 2015 at 10:08 AM, Ji ZHANG zhangj...@gmail.com wrote: Hi, Yes, I'm using createStream, but the storageLevel param is by default

Re: Recommended Scala version

2015-05-28 Thread Tathagata Das
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out soon) with Scala 2.11 and report issues. TD On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers ko...@tresata.com wrote: we are still running into issues with spark-shell not working on 2.11, but we are running on somewhat

DataFrame nested sctructure selection limit

2015-05-28 Thread Eugene Morozov
Hi! I have a json file with some data, I’m able to create DataFrame out of it and the schema for particular part of it I’m interested in looks like following: val json: DataFrame = sqlc.load(entities_with_address2.json, json) root |-- attributes: struct (nullable = true) ||-- Address2:

How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Nan Xiao
Hi all, I want to use Eclipse on Windows to build Spark environment, but find the reference page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup) doesn't contain any guide about Eclipse. Could anyone give tutorials or links about how to using

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Ji ZHANG
Hi, I wrote a simple test job, it only does very basic operations. for example: val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic - 1)).map(_._2) val logs = lines.flatMap { line = try { Some(parse(line).extract[Impression]) } catch { case _:

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Nizan Grauer
hi, thanks for your answer! I have few more: 1) the file /root/spark/conf/slaves , has the full DNS names of servers ( ec2-52-26-7-137.us-west-2.compute.amazonaws.com), did you add there the internal ip? 2) You call to start-all. Isn't it too aggressive? Let's say I have 20 slaves up, and I

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
1. Upto you, you can either add internal ip or the external ip, it won't be a problem unless they are not in the same network. 2. If you only want to start a particular slave, then you can do like: sbin/start-slave.sh worker# master-spark-URL Thanks Best Regards On Thu, May 28, 2015 at 1:52

Get all servers in security group in bash(ec2)

2015-05-28 Thread nizang
hi, Is there anyway in bash (from an ec2 apsrk server) to list all the servers in my security group (or better - in a given security group) I tried using: wget -q -O - http://instance-data/latest/meta-data/security-groups security_group_xxx but now, I want all the servers in security group

Re: Get all servers in security group in bash(ec2)

2015-05-28 Thread Akhil Das
You can use python boto library for that, in fact spark-ec2 script uses it underneath. Here's the https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L706 call spark-ec2 is making to get all machines under a given security group. Thanks Best Regards On Thu, May 28, 2015 at 2:22 PM,

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Gerard Maas
Hi, tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark streaming processes is not supported. *Longer version.* I assume that you are talking about Spark Streaming as the discussion is about handing Kafka streaming data. Then you have two things to consider: the Streaming

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
Can you replace your counting part with this? logs.filter(_.s_id 0).foreachRDD(rdd = logger.info(rdd.count())) Thanks Best Regards On Thu, May 28, 2015 at 1:02 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I wrote a simple test job, it only does very basic operations. for example:

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thank you, Gerard. We're looking at the receiver-less setup with Kafka Spark streaming so I'm not sure how to apply your comments to that case (not that we have to use receiver-less but it seems to offer some advantages over the receiver-based). As far as the number of Kafka receivers is fixed

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
@DG; The key metrics should be - Scheduling delay – its ideal state is to remain constant over time and ideally be less than the time of the microbatch window - The average job processing time should remain less than the micro-batch window - Number of Lost Jobs

Re: why does com.esotericsoftware.kryo.KryoException: java.u til.ConcurrentModificationException happen?

2015-05-28 Thread randylu
begs for your help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-does-com-esotericsoftware-kryo-KryoException-java-u-til-ConcurrentModificationException-happen-tp23067p23068.html Sent from the Apache Spark User List mailing list archive at

why does com.esotericsoftware.kryo.KryoException: java.u til.ConcurrentModificationException happen?

2015-05-28 Thread randylu
My program runs for 500 iterations, but fails at about 150 iterations almostly. It's hard to explain the details of my program, but i think my program is ok, for it runs succesfully somtimes. *I just wana know in which situations this exception will happen*. The detail error information is

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by

SPARK STREAMING PROBLEM

2015-05-28 Thread Animesh Baranawal
Hi, I am trying to extract the filenames from which a Dstream is generated by parsing the toDebugString method on RDD I am implementing the following code in spark-shell: import org.apache.spark.streaming.{StreamingContext, Seconds} val ssc = new StreamingContext(sc,Seconds(10)) val lines =

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
You can always spin new boxes in the background and bring them into the cluster fold when fully operational and time that with job relaunch and param change Kafka offsets are mabaged automatically for you by the kafka clients which keep them in zoomeeper dont worry about that ad long as you

Re: SPARK STREAMING PROBLEM

2015-05-28 Thread Sourav Chandra
You must start the StreamingContext by calling ssc.start() On Thu, May 28, 2015 at 6:57 PM, Animesh Baranawal animeshbarana...@gmail.com wrote: Hi, I am trying to extract the filenames from which a Dstream is generated by parsing the toDebugString method on RDD I am implementing the

Dataframe Partitioning

2015-05-28 Thread Masf
Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner join between these dataframes, the result contains 200 partitions. *Why?* df1.join(df2, df1(id) === df2(id), Inner) = returns 200 partitions Thanks!!! -- Regards. Miguel Ángel

Fwd: SPARK STREAMING PROBLEM

2015-05-28 Thread Animesh Baranawal
I also started the streaming context by running ssc.start() but still apart from logs nothing of g gets printed. -- Forwarded message -- From: Animesh Baranawal animeshbarana...@gmail.com Date: Thu, May 28, 2015 at 6:57 PM Subject: SPARK STREAMING PROBLEM To: user@spark.apache.org

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-28 Thread Andrew Otto
val sqlContext = new HiveContext(sc) val schemaRdd = sqlContext.sql(some complex SQL) It mostly works, but have been having issues with tables that contains a large amount of data: https://issues.apache.org/jira/browse/SPARK-6910 https://issues.apache.org/jira/browse/SPARK-6910 On May

Soft distinct on data frames.

2015-05-28 Thread Jan-Paul Bultmann
Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data. I’ve also noticed that plans generated with an unique transformation have

Re: SPARK STREAMING PROBLEM

2015-05-28 Thread Sourav Chandra
The oproblem lies the way you are doing the processing. After the g.foreach(x = {println(x); println()}) are you doing ssc.start. It means till now what you did is just setup the computation stpes but spark has not started any real processing. so when you do g.foreach what it iterates

[Streaming] Configure executor logging on Mesos

2015-05-28 Thread Gerard Maas
in the spark assembly: I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave 20150528-063307-780930314-5050-8152-S5 Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties So

Re: Dataframe Partitioning

2015-05-28 Thread Silvio Fiorito
That’s due to the config setting spark.sql.shuffle.partitions which defaults to 200 From: Masf Date: Thursday, May 28, 2015 at 10:02 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Dataframe Partitioning Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do

Spark streaming with kafka

2015-05-28 Thread boci
Hi guys, I using spark streaming with kafka... In local machine (start as java application without using spark-submit) it's work, connect to kafka and do the job (*). I tried to put into spark docker container (hadoop 2.6, spark 1.3.1, try spark submit wil local[5] and yarn-client too ) but I'm

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Evo, good points. On the dynamic resource allocation, I'm surmising this only works within a particular cluster setup. So it improves the usage of current cluster resources but it doesn't make the cluster itself elastic. At least, that's my understanding. Memory + disk would be good and

Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Ashish Mukherjee
Hello, I was wondering if there is any documented comparison of SparkSQL with MemSQL/VoltDB kind of in-memory SQL databases. MemSQL etc. too allow queries to be run in a clustered environment. What is the major differentiation? Regards, Ashish

Best practice to update a MongoDB document from Sparks

2015-05-28 Thread nibiau
Hello, I'm evaluating Spark/SparkStreaming . I use SparkStreaming to receive messages from a Kafka topic. As soon as I have a JavaReceiverInputDStream , I have to treat each message, for each one I have to search in MongoDB to find if a document does exist. If I found the document I have to

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK – it will be your insurance policy against sys crashes due to memory leaks. Until there is free RAM, spark streaming (spark) will NOT resort to disk – and of course resorting to disk from time to time (ie when there is no

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Ji ZHANG
Hi, Unfortunately, they're still growing, both driver and executors. I run the same job with local mode, everything is fine. On Thu, May 28, 2015 at 5:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you replace your counting part with this? logs.filter(_.s_id 0).foreachRDD(rdd =

Re: debug jsonRDD problem?

2015-05-28 Thread Michael Stone
On Wed, May 27, 2015 at 02:06:16PM -0700, Ted Yu wrote: Looks like the exception was caused by resolved.get(prefix ++ a) returning None :         a = StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should

PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Sam Stoelinga
Hi sparkers, I am working on a PySpark application which uses the OpenCV library. It runs fine when running the code locally but when I try to run it on Spark on the same Machine it crashes the worker. The code can be found here: https://gist.github.com/samos123/885f9fe87c8fa5abf78f This is the

Loading CSV to DataFrame and saving it into Parquet for speedup

2015-05-28 Thread M Rez
I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple of unified DataFrames. Since this process is slow I have wrote this snippet: targetList.foreach { target = // this is using sqlContext.load by getting list of files then loading them according to schema files