Spark Kafka stream processing time increasing gradually

2016-06-12 Thread Roshan Singh
Hi all, I have a python streaming job which is supposed to run 24x7. I am unable to stabilize it. The job just counts no of links shared in a 30 minute sliding window. I am using reduceByKeyAndWindow operation with a batch of 30 seconds, slide interval of 60 seconds. The kafka queue has a rate of

Spark Thrift Server in CDH 5.3

2016-06-12 Thread pooja mehta
Hi, How do I start Spark Thrift Server with cloudera CDH 5.3? Thanks.

RE: Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
Can someone look at my questions? Thanks again! From: Haopu Wang Sent: 2016年6月12日 16:40 To: user@spark.apache.org Subject: Should I avoid "state" in an Spark application? I have a Spark application whose structure is below: var ts: Long = 0L d

Re: What is the interpretation of Cores in Spark doc

2016-06-12 Thread Daniel Darabos
Spark is a software product. In software a "core" is something that a process can run on. So it's a "virtual core". (Do not call these "threads". A "thread" is not something a process can run on.) local[*] uses java.lang.Runtime.availableProcessors()

What is the interpretation of Cores in Spark doc

2016-06-12 Thread Mich Talebzadeh
Hi, I was writing some docs on Spark P&T and came across this. It is about the terminology or interpretation of that in Spark doc. This is my understanding of cores and threads. Cores are physical cores. Threads are virtual cores. Cores with 2 threads is called hyper threading technology so 2

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-12 Thread Chris Fregly
two of my faves: https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ (Cloudera authors) https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ (IBM author) (most) authors are Spark Committers. while not totally up to date w/ ML pipelines an

Re: Spark Getting data from MongoDB in JAVA

2016-06-12 Thread Ted Yu
What's the value of spark.version ? Do you know which version of Spark mongodb connector 0.10.3 was built against ? You can use the following command to find out: mvn dependency:tree Maybe the Spark version you use is different from what mongodb connector was built against. On Fri, Jun 10, 2016

Re: Spark Getting data from MongoDB in JAVA

2016-06-12 Thread vaquar khan
Hi Asfanyar, *NoSuchMethodError *in Java means you compiled against one version of code , and executed against a different version. Please make sure your java version and adding dependency version is working on same java version. regards, vaquar khan On Fri, Jun 10, 2016 at 4:50 AM, Asfandyar A

Re: Questions about Spark Worker

2016-06-12 Thread vaquar khan
Agreed with Mich The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master. *spark.driver.host :* Hostname or IP address for the driver to listen on. This is used for communicating with the executors and the standalone Ma

Re: Running Spark in Standalone or local modes

2016-06-12 Thread Ashok Kumar
Thanks Mich. Great explanation On Saturday, 11 June 2016, 22:35, Mich Talebzadeh wrote: Hi Gavin, I believe in standalone mode a simple cluster manager is included with Spark that makes it easyto set up a cluster.It does not rely on YARN or Mesos. In summary this is from my notes:

Re: OutOfMemoryError - When saving Word2Vec

2016-06-12 Thread vaquar khan
Hi Sharad. The array size you (or the serializer) tries to allocate is just too big for the JVM. You can also split your input further by increasing parallelism. Following is good explanintion https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit regards, Vaquar khan On Sun

Re: Questions about Spark Worker

2016-06-12 Thread Mich Talebzadeh
Hi, You basically want to use wired/Ethernet connections as opposed to wireless? in Your Spark Web UI under environment table what do you get for " spark.driver.host". Also can you cat /etc/hosts and send the output please and the output from ifconfig -a HTH Dr Mich Talebzadeh LinkedIn *

Questions about Spark Worker

2016-06-12 Thread East Evil
Hi, guys My question is about Spark Worker IP address. I have four nodes, four nodes have Wireless module and Ethernet module, so all nodes have two IP addresses. When I vist the webUI, information is always displayed in the Wireless IP address but my Spark computing cluster based on Ethernet. I

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-12 Thread Deepak Goel
Thank You...Please see inline.. On Sun, Jun 12, 2016 at 3:39 PM, wrote: > Machine learning - I would suggest that you pick up a fine book that > explains machine learning. That's the way I went about - pick up each type > of machine learning concept - say Linear regression then understand the >

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-12 Thread Mohit Jaggi
Looks like a bug in the code generating the SQL query…why would it be specific to SAS, I can’t guess. Did you try the same with another database? As a workaround you can write the select statement yourself instead of just providing the table name. > On Jun 11, 2016, at 6:27 PM, Ajay Chander wr

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-12 Thread mylisttech
Machine learning - I would suggest that you pick up a fine book that explains machine learning. That's the way I went about - pick up each type of machine learning concept - say Linear regression then understand the why/when/how etc and infer results etc. Then apply the learning to a small dat

OutOfMemoryError - When saving Word2Vec

2016-06-12 Thread sharad82
When trying to save the word2vec model trained over 10G of data leads to below OOM error. java.lang.OutOfMemoryError: Requested array size exceeds VM limit Spark Version: 1.6 spark.dynamicAllocation.enable false spark.executor.memory 75g spark.driver.memory 150g spark.driver.cores 10

Several questions about how pyspark.ml works

2016-06-12 Thread XapaJIaMnu
Hey, I have some additional Spark ML algorithms implemented in scala that I would like to make available in pyspark. For a reference I am looking at the available logistic regression implementation here: https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/ml/classification.html I hav

How to use a model generated early in a stage in ML pipelines

2016-06-12 Thread Hayri Volkan Agun
Hi, I have a pipeline for classification. However before classification I want to use a model generated early in a stage. How can I get this model reference to use as an input to another stage. Where the model references are hold generated in pipeline. How can I get the model by uid or etc. Consi

Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
I have a Spark application whose structure is below: var ts: Long = 0L dstream1.foreachRDD{ (x, time) => { ts = time x.do_something()... } } .. process_data(dstream2, ts, ..) I assume foreachRDD function call can up