Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
Thanks, will try this out and get back... On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das t...@databricks.com wrote: Try adding the provided scopes dependency !-- Spark dependency -- groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Benjamin Fradet
Are you using checkpointing? I had a similar issue when recreating a streaming context from checkpoint as broadcast variables are not checkpointed. On 23 Jun 2015 5:01 pm, Nipun Arora nipunarora2...@gmail.com wrote: Hi, I have a spark streaming application where I need to access a model saved

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
btw. just for reference I have added the code in a gist: https://gist.github.com/nipunarora/ed987e45028250248edc and a stackoverflow reference here: http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming On Tue, Jun 23, 2015 at 11:01 AM, Nipun

Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
It all depends on what it is you need to do with the pages. If you’re just going to be collecting them then it’s really not much different than a groupByKey. If instead you’re looking to derive some other value from the series of pages then you could potentially partition by user id and run a

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I don't think I have explicitly check-pointed anywhere. Unless it's internal in some interface, I don't believe the application is checkpointed. Thanks for the suggestion though.. Nipun On Tue, Jun 23, 2015 at 11:05 AM, Benjamin Fradet benjamin.fra...@gmail.com wrote: Are you using

[Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
Hi, I have a spark streaming application where I need to access a model saved in a HashMap. I have *no problems in running the same code with broadcast variables in the local installation.* However I get a *null pointer* *exception* when I deploy it on my spark test cluster. I have stored a

Re: SQL vs. DataFrame API

2015-06-23 Thread Bob Corsaro
Thanks! The solution: https://gist.github.com/dokipen/018a1deeab668efdf455 On Mon, Jun 22, 2015 at 4:33 PM Davies Liu dav...@databricks.com wrote: Right now, we can not figure out which column you referenced in `select`, if there are multiple row with the same name in the joined DataFrame

java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Juan Rodríguez Hortalá
Hi, I'm running a program in Spark 1.4 where several Spark Streaming contexts are created from the same Spark context. As pointed in https://spark.apache.org/docs/latest/streaming-programming-guide.html each Spark Streaming context is stopped before creating the next Spark Streaming context. The

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
Thanks. Yes, unfortunately, they all need to be grouped. I guess I can partition the record by user id. However, I have millions of users, do you think partition by user id will help? Jianguo On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: You’re right

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread scar scar
Thank you Tathagata, It is great to know about this issue, but our problem is a little bit different. We have 3 nodes in our Spark cluster, and when the Zookeeper leader dies, the Master Spark gets shut down, and remains down, but a new master gets elected and loads the UI. I think if the problem

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
That issue happens only in python dsl? El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió: Thanks! The solution: https://gist.github.com/dokipen/018a1deeab668efdf455 On Mon, Jun 22, 2015 at 4:33 PM Davies Liu dav...@databricks.com wrote: Right now, we can not figure out which

SPARK-8566

2015-06-23 Thread Eric Friedman
I logged this Jira this morning: https://issues.apache.org/jira/browse/SPARK-8566 I'm curious if any of the cognoscenti can advise as to a likely cause of the problem?

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null pointer exception. Thanks Nipun On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora nipunarora2...@gmail.com wrote: btw. just for reference I have added the code in a

Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread maxdml
I'm wondering if there is a real benefit for splitting my memory in two for the datanode/workers. Datanodes and OS needs memory to perform their business. I suppose there could be loss of performance if they came to compete for memory with the worker(s). Any opinion? :-) -- View this message

org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Koert Kuipers
just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark sql code. the issue away when i changed my tests to run consecutively...

Re: SQL vs. DataFrame API

2015-06-23 Thread Bob Corsaro
I've only tried it in python On Tue, Jun 23, 2015 at 12:16 PM Ignacio Blasco elnopin...@gmail.com wrote: That issue happens only in python dsl? El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió: Thanks! The solution: https://gist.github.com/dokipen/018a1deeab668efdf455 On

Re: Help optimising Spark SQL query

2015-06-23 Thread Sabarish Sasidharan
64GB in parquet could be many billions of rows because of the columnar compression. And count distinct by itself is an expensive operation. This is not just on Spark, even on Presto/Impala, you would see performance dip with count distincts. And the cluster is not that powerful either. The one

Limitations using SparkContext

2015-06-23 Thread daunnc
So the situation is following: got a spray server, with a spark context available (fair scheduling in a cluster mode, via spark-submit). There are some http urls, which calling spark rdd, and collecting information from accumulo / hdfs / etc (using rdd). Noticed, that there is a sort of

Re: Limitations using SparkContext

2015-06-23 Thread Richard Marscher
Hi, can you detail the symptom further? Was it that only 12 requests were services and the other 440 timed out? I don't think that Spark is well suited for this kind of workload, or at least the way it is being represented. How long does a single request take Spark to complete? Even with fair

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA? On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers ko...@tresata.com wrote: just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska
Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building with the following: ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I see that in

Re: Registering custom metrics

2015-06-23 Thread Otis Gospodnetić
Hi, Not sure if this will fit your needs, but if you are trying to collect+chart some metrics specific to your app, yet want to correlate them with what's going on in Spark, maybe Spark's performance numbers, you may want to send your custom metrics to SPM, so they can be

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
You must not include spark-core and spark-streaming in the assembly. They are already present in the installation and the presence of multiple versions of spark may throw off the classloaders in weird ways. So make the assembly marking the those dependencies as scope=provided. On Tue, Jun 23,

When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread commtech
Hi, I work at a large financial institution in New York. We're looking into Spark and trying to learn more about the deployment/use cases for real-time analytics with Spark. When would it be better to deploy standalone Spark versus Spark on top of a more comprehensive data management layer

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Tathagata Das
Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I'm running a program in Spark 1.4 where

spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
hi While using spark streaming (1.2) with kafka . I am getting below error and receivers are getting killed but jobs get scheduled at each stream interval. 15/06/23 18:42:35 WARN TaskSetManager: Lost task 0.1 in stage 18.0 (TID 82, ip(XX)): java.io.IOException: Failed to connect to

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Tathagata Das
Yes, this is a known behavior. Some static stuff are not serialized as part of a task. On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora nipunarora2...@gmail.com wrote: I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null

Re: Spark standalone cluster - resource management

2015-06-23 Thread Igor Berman
probably there are already running jobs there in addition, memory is also a resource, so if you are running 1 application that took all your memory and then you are trying to run another application that asks for the memory the cluster doesn't have then the second app wont be running so why are u

Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate you have to make sure that you streaming app is stable, that is, batches are processed as fast as they are received (scheduling delay in the spark streaming UI is approx 0). TD On Tue, Jun 23, 2015 at 2:49 AM, anshu

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
Why are you mixing spark versions between streaming and core?? Your core is 1.2.0 and streaming is 1.4.0. On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora shushantaror...@gmail.com wrote: It throws exception for WriteAheadLogUtils after excluding core and streaming jar. Exception in thread

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-23 Thread Guillaume Pitel
Hi, So I've done this Node-centered accumulator, I've written a small piece about it : http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/ Hope it can help someone Guillaume 2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com

Re: How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Richard Marscher
There should be no difference assuming you don't use the intermediately stored rdd values you are creating for anything else (rdd1, rdd2). In the first example it still is creating these intermediate rdd objects you are just using them implicitly and not storing the value. It's also worth

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
It throws exception for WriteAheadLogUtils after excluding core and streaming jar. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/streaming/util/WriteAheadLogUtils$ at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:84) at

Kafka createDirectStream ​issue

2015-06-23 Thread syepes
Hello, I ​am trying ​use the new Kafka ​consumer ​​KafkaUtils.createDirectStream​ but I am having some issues making it work. I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and I am still getting the same strange exception ClassNotFoundException: $line49.$read$$iwC$$i

Re: Spark standalone cluster - resource management

2015-06-23 Thread Nizan Grauer
I'm having 30G per machine This is the first (and only) job I'm trying to submit. So it's weird that for --total-executor-cores=20 it works, and for --total-executor-cores=4 it doesn't On Tue, Jun 23, 2015 at 10:46 PM, Igor Berman igor.ber...@gmail.com wrote: probably there are already running

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
Thanks a lot. It worked after keeping all versions to same.1.2.0 On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das t...@databricks.com wrote: Why are you mixing spark versions between streaming and core?? Your core is 1.2.0 and streaming is 1.4.0. On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora

kafka spark streaming with mesos

2015-06-23 Thread Bartek Radziszewski
Hey, I’m trying to run kafka spark streaming using mesos with following example: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.StringDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Cody Koeninger
The exception $line49 is referring to a line of the spark shell. Have you tried it from an actual assembled job with spark-submit ? On Tue, Jun 23, 2015 at 3:48 PM, syepes sye...@gmail.com wrote: Hello, I ​am trying ​use the new Kafka ​consumer ​​KafkaUtils.createDirectStream​ but I am

How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Ashish Soni
Hi All , What is difference between below in terms of execution to the cluster with 1 or more worker node rdd.map(...).map(...)...map(..) vs val rdd1 = rdd.map(...) val rdd2 = rdd1.map(...) val rdd3 = rdd2.map(...) Thanks, Ashish

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
It seems that it doesn't happen in Scala API. Not exactly the same as in python, but pretty close. https://gist.github.com/elnopintan/675968d2e4be68958df8 2015-06-23 23:11 GMT+02:00 Davies Liu dav...@databricks.com: I think it also happens in DataFrames API of all languages. On Tue, Jun 23,

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
I think it also happens in DataFrames API of all languages. On Tue, Jun 23, 2015 at 9:16 AM, Ignacio Blasco elnopin...@gmail.com wrote: That issue happens only in python dsl? El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió: Thanks! The solution:

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I can not. I've already limited the number of cores to 10, so it gets 5 executors with 2 cores each... wt., 23.06.2015 o 13:45 użytkownik Akhil Das ak...@sigmoidanalytics.com napisał: Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks

flume sinks supported by spark streaming

2015-06-23 Thread Hafiz Mujadid
Hi! I want to integrate flume with spark streaming. I want to know which sink type of flume are supported by spark streaming? I saw one example using avroSink. Thanks -- View this message in context:

Re: when cached RDD will unpersist its data

2015-06-23 Thread eric wong
In a case that memory cannot hold all the cached RDD, then BlockManager will evict some older block for storage of new RDD block. Hope that will helpful. 2015-06-24 13:22 GMT+08:00 bit1...@163.com bit1...@163.com: I am kind of consused about when cached RDD will unpersist its data. I know we

when cached RDD will unpersist its data

2015-06-23 Thread bit1...@163.com
I am kind of consused about when cached RDD will unpersist its data. I know we can explicitly unpersist it with RDD.unpersist ,but can it be unpersist automatically by the spark framework? Thanks. bit1...@163.com

Re: Spark standalone cluster - resource management

2015-06-23 Thread nizang
to give a bit more data on what I'm trying to get - I have many tasks I want to run in parallel, so I want each task to catch small part of the cluster (- only limited part of my 20 cores in the cluster) I have important tasks that I want them to get 10 cores, and I have small tasks that I want

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Murthy Chelankuri
I am invoking it from the java application by creating the sparkcontext On Tue, Jun 23, 2015 at 12:17 PM, Tathagata Das t...@databricks.com wrote: How are you adding that to the classpath? Through spark-submit or otherwise? On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
How are you adding that to the classpath? Through spark-submit or otherwise? On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri kmurt...@gmail.com wrote: Yes I have the producer in the class path. And I am using in standalone mode. Sent from my iPhone On 23-Jun-2015, at 3:31 am, Tathagata

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
So you have Kafka in your classpath in you Java application, where you are creating the sparkContext with the spark standalone master URL, right? The recommended way of submitting spark applications to any cluster is using spark-submit. See

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-23 Thread Tathagata Das
queue stream does not support driver checkpoint recovery since the RDDs in the queue are arbitrary generated by the user and its hard for Spark Streaming to keep track of the data in the RDDs (thats necessary for recovering from checkpoint). Anyways queue stream is meant of testing and

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Why don't you do a normal .saveAsTextFiles? Thanks Best Regards On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla anshushuk...@gmail.com wrote: Thanx for reply !! YES , Either it should write on any machine of cluster or Can you please help me ... that how to do this . Previously i was

MLLIB - Storing the Trained Model

2015-06-23 Thread samsudhin
HI All, I was trying to store a trained model to the local hard disk. i am able to save it using save() function. while i am trying to retrieve the stored model using load() function i am end up with following error. kindly help me on this. scala val sameModel =

Spark standalone cluster - resource management

2015-06-23 Thread nizang
hi, I'm running spark standalone cluster with 5 slaves, each has 4 cores. When I run job with the following configuration: /root/spark/bin/spark-submit -v --total-executor-cores 20 --executor-memory 22g --executor-cores 4 --class com.windward.spark.apps.MyApp --name dev-app

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread Tathagata Das
Maybe this is a known issue with spark streaming and master web ui. Disable event logging, and it should be fine. https://issues.apache.org/jira/browse/SPARK-6270 On Mon, Jun 22, 2015 at 8:54 AM, scar scar scar0...@gmail.com wrote: Sorry I was on vacation for a few days. Yes, it is on. This is

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Murthy Chelankuri
yes , in spark standalone mode witht the master URL. Jar are copying to execeutor and the application is running fine but its failing at some point when kafka is trying to load the classes using some reflection mechanisims for loading the Encoder and Partitioner classes. Here are my finding so

Re: Spark job fails silently

2015-06-23 Thread Akhil Das
Looks like a hostname conflict to me. 15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface eth0) 15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Can you paste

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Akhil Das
May be while producing the messages, you can make it as a keyedMessage with the timestamp as key and on the consumer end you can easily identify the key (which will be the timestamp) from the message. If the network is fast enough, then i think there could be a small millisecond lag. Thanks Best

How to figure out how many records received by individual receiver

2015-06-23 Thread bit1...@163.com
Hi, I am using spark1.3.1, and have 2 receivers, On the web UI, I can only see the total records received by all these 2 receivers, but I can't figure out the records received by individual receiver? Not sure whether the information is shown on the UI in spark1.4. bit1...@163.com

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
This could be because of some subtle change in the classloaders used by executors. I think there has been issues in the past with libraries that use Class.forName to find classes by reflection. Because the executors load classes dynamically using custom class loaders, libraries that use

RE: MLLIB - Storing the Trained Model

2015-06-23 Thread Yang, Yuhao
Hi Samsudhin, If possible, can you please provide a part of the code? Or perhaps try with the ut in RandomForestSuite to see if the issue repros. Regards, yuhao -Original Message- From: samsudhin [mailto:samsud...@pigstick.com] Sent: Tuesday, June 23, 2015 2:14 PM To:

Re: What does [Stage 0: (0 + 2) / 2] mean on the console

2015-06-23 Thread Akhil Das
Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4040). Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are left with 2 threads to process ((0 + 2) -- This 2 is your 2 threads.) And the

Re: Programming with java on spark

2015-06-23 Thread Akhil Das
Did you happened to try this? JavaPairRDDInteger, String hadoopFile = sc.hadoopFile( /sigmoid, DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 yadanfu1...@gmail.com wrote: Hello, everyone! I'm new in spark.

What does [Stage 0: (0 + 2) / 2] mean on the console

2015-06-23 Thread bit1...@163.com
Hi, I have a spark streaming application that runs locally with two receivers, some code snippet is as follows: conf.setMaster(local[4]) //RPC Log Streaming val rpcStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, consumerParams, topicRPC,

Re: Re: What does [Stage 0: (0 + 2) / 2] mean on the console

2015-06-23 Thread bit1...@163.com
Hi, Akhil, Thank you for the explanation! bit1...@163.com From: Akhil Das Date: 2015-06-23 16:29 To: bit1...@163.com CC: user Subject: Re: What does [Stage 0: (0 + 2) / 2] mean on the console Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-23 Thread Matthew Johnson
Awesome, thanks Pawan – for now I’ll give spark-notebook a go until Zeppelin catches up to Spark 1.4 (and when Zeppelin has a binary release – my PC doesn’t seem too happy about building a Node.js app from source). Thanks for the detailed instructions!! *From:* pawan kumar

Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you can see, node doesn't have much RAM. I have 2 streaming apps, first one is configured to use 3GB of memory per node and second one uses 2GB per node. My problem is, that smaller app could easily run

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła w.pit...@gmail.com wrote: I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you

Calculating tuple count /input rate with time

2015-06-23 Thread anshu shukla
I am calculating input rate using the following logic. And i think this foreachRDD is always running on driver (println are seen on driver) 1- Is there any other way to do that in less cost . 2- Will this give me the correct count for rate . //code - inputStream.foreachRDD(new

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread anshu shukla
Thanks alot , Because i just want to log timestamp and unique message id and not full RDD . On Tue, Jun 23, 2015 at 12:41 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Why don't you do a normal .saveAsTextFiles? Thanks Best Regards On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla

Re: Velox Model Server

2015-06-23 Thread Sean Owen
Yes, and typically needs are 100ms. Now imagine even 10 concurrent requests. My experience has been that this approach won't nearly scale. The best you could probably do is async mini-batch near-real-time scoring, pushing results to some store for retrieval, which could be entirely suitable for

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
Thanks for the suggestions everyone, appreciate the advice. I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead of 1.3, replacing the date casts with a between operation on the corresponding long constants instead and changing COUNT(*) to COUNT(1). None of these seem to

Re: s3 - Can't make directory for path

2015-06-23 Thread Steve Loughran
On 23 Jun 2015, at 00:09, Danny kont...@dannylinden.de wrote: hi, have you tested s3://ww-sandbox/name_of_path/ instead of s3://ww-sandbox/name_of_path + make sure the bucket is there already. Hadoop s3 clients don't currently handle that step or have you test to add your file

RE: Web UI vs History Server Bugs

2015-06-23 Thread Evo Eftimov
Probably your application has crashed or was terminated without invoking the stop method of spark context - in such cases it doesn't create the empty flag file which apparently tells the history server that it can safely show the log data - simpy go to some of the other dirs of the history server

Re: flume sinks supported by spark streaming

2015-06-23 Thread Ruslan Dautkhanov
https://spark.apache.org/docs/latest/streaming-flume-integration.html Yep, avro sink is the correct one. -- Ruslan Dautkhanov On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi! I want to integrate flume with spark streaming. I want to know which sink

Re: Spark standalone cluster - resource management

2015-06-23 Thread canan chen
Check the available resources you have (cpu cores memory ) on master web ui. The log you see means the job can't get any resources. On Wed, Jun 24, 2015 at 5:03 AM, Nizan Grauer ni...@windward.eu wrote: I'm having 30G per machine This is the first (and only) job I'm trying to submit. So

Re: map V mapPartitions

2015-06-23 Thread canan chen
One example is that you'd like to set up jdbc connection for each partition and share this connection across the records. mapPartitions is much more like the paradigm of mapper in mapreduce. In the mapper of mapreduce, you have setup method to do any initialization stuff before processing the

Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread canan chen
I don't think this is the correct question. Spark can be deployed on different cluster manager frameworks like standard alone, yarn mesos. Spark can't run without these cluster manager framework, that means spark depend on cluster manager framework. And the data management layer is the upstream

Re: Spark launching without all of the requested YARN resources

2015-06-23 Thread canan chen
Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote: Sometimes if my Hortonworks yarn-enabled cluster

Re: Yarn application ID for Spark job on Yarn

2015-06-23 Thread canan chen
I don't think there is yarn related stuff to access in spark. Spark don't depend on yarn. BTW, why do you want the yarn application id ? On Mon, Jun 22, 2015 at 11:45 PM, roy rp...@njit.edu wrote: Hi, Is there a way to get Yarn application ID inside spark application, when running spark

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Tathagata Das
Please run it in your own application and not in the spark shell. I see that you are trying to stop the Spark context and create a new StreamingContext. That will lead to unexpected issue, that you are seeing. Please make a standalone SBT/Maven app for Spark Streaming. On Tue, Jun 23, 2015 at

Re: SQL vs. DataFrame API

2015-06-23 Thread Davies Liu
If yo change to ```val numbers2 = numbers```, then it have the same problem On Tue, Jun 23, 2015 at 2:54 PM, Ignacio Blasco elnopin...@gmail.com wrote: It seems that it doesn't happen in Scala API. Not exactly the same as in python, but pretty close.

Re: map V mapPartitions

2015-06-23 Thread Holden Karau
I think one of the primary cases where mapPartitions is useful if you are going to be doing any setup work that can be re-used between processing each element, this way the setup work only needs to be done once per partition (for example creating an instance of jodatime). Both map and

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
A rough estimate of the worst case memory requirement for driver is about 2 * k * runs * numFeatures * numPartitions * 8 bytes. I put 2 at the beginning because the previous centers are still in memory while receiving new center updates. -Xiangrui On Fri, Jun 19, 2015 at 9:02 AM, Rogers Jeffrey

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-23 Thread Xiangrui Meng
This is on the wish list for Spark 1.5. Assuming that the items from the same transaction are distinct. We can still follow FP-Growth's steps: 1. find frequent items 2. filter transactions and keep only frequent items 3. do NOT order by frequency 4. use suffix to partition the transactions

Re: Issue running Spark 1.4 on Yarn

2015-06-23 Thread Matt Kapilevich
Hi Kevin I never did. I checked for free space in the root partition, don't think this was an issue. Now that 1.4 is officially out I'll probably give it another shot. On Jun 22, 2015 4:28 PM, Kevin Markey kevin.mar...@oracle.com wrote: Matt: Did you ever resolve this issue? When running on a

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-23 Thread Xiangrui Meng
It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your dataset? 2. If you monitor the progress in the WebUI,

Re: How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-23 Thread Xiangrui Meng
Please check the input path to your test data, and call `.count()` and see whether there are records in it. -Xiangrui On Sat, Jun 20, 2015 at 9:23 PM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I am testing the StreamingLinearRegressionWithSGD following the tutorial. It works, but I could

Re: which mllib algorithm for large multi-class classification?

2015-06-23 Thread Xiangrui Meng
We have multinomial logistic regression implemented. For your case, the model size is 500 * 300,000 = 150,000,000. MLlib's implementation might not be able to handle it efficiently, we plan to have a more scalable implementation in 1.5. However, it shouldn't give you an array larger than MaxInt

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
Please see the current version of code for better documentation. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Roberto Congiu
I wrote a brief howto on building nested records in spark and storing them in parquet here: http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/ 2015-06-23 16:12 GMT-07:00 Richard Catlin richard.m.cat...@gmail.com: How do I create a DataFrame(SchemaRDD) with a nested array of Rows

Re: mutable vs. pure functional implementation - StatCounter

2015-06-23 Thread Xiangrui Meng
Creating millions of temporary (immutable) objects is bad for performance. It should be simple to do a micro-benchmark locally. -Xiangrui On Mon, Jun 22, 2015 at 7:25 PM, mzeltser mzelt...@gmail.com wrote: Using StatCounter as an example, I'd like to understand if pure functional implementation

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread Wei Zhou
Hi DB Tsai, Thanks for your reply. I went through the source code of LinearRegression.scala. The algorithm minimizes square error L = 1/2n ||A weights - y||^2^. I cannot match this with the elasticNet loss function found here http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html, which is the

Re: Kafka createDirectStream ​issue

2015-06-23 Thread syepes
yes, I have two clusters one standalone an another using Mesos Sebastian YEPES http://sebastian-yepes.com On Wed, Jun 24, 2015 at 12:37 AM, drarse [via Apache Spark User List] ml-node+s1001560n23457...@n3.nabble.com wrote: Hi syepes, Are u run the application in standalone mode? Regards

RE: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Richard Catlin
How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a column? Is there an example? Will this store as a nested parquet file? Thanks. Richard Catlin

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread Wei Zhou
Thanks DB Tsai, it is very helpful. Cheers, Wei 2015-06-23 16:00 GMT-07:00 DB Tsai dbt...@dbtsai.com: Please see the current version of code for better documentation. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

[no subject]

2015-06-23 Thread ๏̯͡๏
I have a Spark job that has 7 stages. The first 3 stage complete and the fourth stage beings (joins two RDDs). This stage has multiple task failures all the below exception. Multiple tasks (100s) of them get the same exception with different hosts. How can all the host suddenly stop responding

Re: Kafka createDirectStream ​issue

2015-06-23 Thread drarse
Hi syepes, Are u run the application in standalone mode? Regards El 23/06/2015 22:48, syepes [via Apache Spark User List] ml-node+s1001560n23456...@n3.nabble.com escribió: Hello, I ​am trying ​use the new Kafka ​consumer ​​KafkaUtils.createDirectStream​ but I am having some issues making it

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
The regularization is handled after the objective function of data is computed. See https://github.com/apache/spark/blob/6a827d5d1ec520f129e42c3818fe7d0d870dcbef/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala line 348 for L2. For L1, it's handled by OWLQN, so you

map V mapPartitions

2015-06-23 Thread ๏̯͡๏
I know when to use a map () but when should i use mapPartitions() ? Which is faster ? -- Deepak

  1   2   >