Spark streaming over a rest API

2015-05-18 Thread juandasgandaras
Hello, I would like to use spark streaming over a REST api to get information along the time and with diferent parameters in the REST query. I was thinking to use apache kafka but I don´t have any experience with this and I would like to have some advice about this. Thanks. Best regards,

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
we = Sigmoid back-pressuring mechanism = Stoping the receiver from receiving more messages when its about to exhaust the worker memory. Here's a similar https://issues.apache.org/jira/browse/SPARK-7398 kind of proposal if you haven't seen already. Thanks Best Regards On Mon, May 18, 2015 at

Re: Forbidded : Error Code: 403

2015-05-18 Thread Mohammad Tariq
Tried almost all the options, but it did not work. So, I ended up creating a new IAM user and the keys of this user are working fine. I am not getting Forbidden(403) exception now, but my program seems to be running infinitely. It's not throwing any exception, but continues to run continuously

Re: Processing multiple columns in parallel

2015-05-18 Thread ayan guha
My first thought would be creating 10 rdds and run your word count on each of them..I think spark scheduler is going to resolve dependency in parallel and launch 10 jobs. Best Ayan On 18 May 2015 23:41, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, Consider I have a tab delimited text

pass configuration parameters to PySpark job

2015-05-18 Thread Oleg Ruchovets
Hi , I am looking a way to pass configuration parameters to spark job. In general I have quite simple PySpark job. def process_model(k, vc): do something sc = SparkContext(appName=TAD) lines = sc.textFile(input_job_files) result =

parsedData option

2015-05-18 Thread Ricardo Goncalves da Silva
Hi Team, My dataset has the following format: CELLPHONE,KL_1,KL_2,KL_3,KL_4,KL_5 1120100114,-5.3244062521117e-003,-4.10825709805041e-003,-1.7816995027779e-002,-4.21462029980323e-003,-1.6200555039e-002 i.e., a reader in the first column and the data separated by comas. To load this data I’m

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
Who are “we” and what is the mysterious “back-pressuring mechanism” and is it part of the Spark Distribution (are you talking about implementation of the custom feedback loop mentioned in my previous emails below)- asking these because I can assure you that at least as of Spark Streaming 1.2.0,

Working with slides. How do I know how many times a RDD has been processed?

2015-05-18 Thread Guillermo Ortiz
Hi, I have two streaming RDD1 and RDD2 and want to cogroup them. Data don't come in the same time and sometimes they could come with some delay. When I get all data I want to insert in MongoDB. For example, imagine that I get: RDD1 -- T 0 RDD2 --T 0.5 I do cogroup between them but I couldn't

Re: Spark and Flink

2015-05-18 Thread Robert Metzger
Hi, I would really recommend you to put your Flink and Spark dependencies into different maven modules. Having them both in the same project will be very hard, if not impossible. Both projects depend on similar projects with slightly different versions. I would suggest a maven module structure

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
We fix the receivers rate at which it should consume at any given point of time. Also we have a back-pressuring mechanism attached to the receivers so it won't simply crashes in the unceremonious way like Evo said. Mesos has some sort of auto-scaling (read it somewhere), may be you can look into

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
Ooow – that is essentially the custom feedback loop mentioned in my previous emails in generic Architecture Terms and what you have done is only one of the possible implementations moreover based on Zookeeper – there are other possible designs not using things like zookeeper at all and hence

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread MEETHU MATHEW
Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually need the pyspark code sample  which shows how  I can call a function from 2 threads and execute it simultaneously. Thanks Regards, Meethu M On Thursday, 14 May 2015 12:38 PM, Akhil Das

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread ayan guha
Hi So to be clear, do you want to run one operation in multiple threads within a function or you want run multiple jobs using multiple threads? I am wondering why python thread module can't be used? Or you have already gave it a try? On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in

How to debug spark in IntelliJ Idea

2015-05-18 Thread Yi.Zhang
Hi all, Currently, I wrote some code lines to access spark master which was deployed on standalone style. I wanted to set the breakpoint for spark master which was running on the different process. I am wondering maybe I need attach process in IntelliJ, so that when AppClient sent the message to

py-files (and others?) not properly set up in cluster-mode Spark Yarn job?

2015-05-18 Thread Shay Rojansky
I'm having issues with submitting a Spark Yarn job in cluster mode when the cluster filesystem is file:///. It seems that additional resources (--py-files) are simply being skipped and not being added into the PYTHONPATH. The same issue may also exist for --jars, --files, etc. We use a simple NFS

Re: number of executors

2015-05-18 Thread edward cui
Oh BTW, it's spark 1.3.1 on hadoop 2.4. AIM 3.6. Sorry for lefting out this information. Appreciate for any help! Ed 2015-05-18 12:53 GMT-04:00 edward cui edwardcu...@gmail.com: I actually have the same problem, but I am not sure whether it is a spark problem or a Yarn problem. I set up a

Re: Reading Real Time Data only from Kafka

2015-05-18 Thread Akhil Das
I have played a bit with the directStream kafka api. Good work cody. These are my findings and also can you clarify a few things for me (see below). - When auto.offset.reset- smallest and you have 60GB of messages in Kafka, it takes forever as it reads the whole 60GB at once. largest will only

org.apache.spark.shuffle.FetchFailedException :: Migration from Spark 1.2 to 1.3

2015-05-18 Thread zia_kayani
Hi, I'm getting this exception after shifting my code from Spark 1.2 to Spark 1.3 15/05/18 18:22:39 WARN TaskSetManager: Lost task 0.0 in stage 1.6 (TID 84, cloud8-server): FetchFailed(BlockManagerId(1, cloud4-server, 7337), shuffleId=0, mapId=9, reduceId=1, message=

Spark groupByKey, does it always create at least 1 partition per key?

2015-05-18 Thread tomboyle
I am currently using spark streaming. During my batch processing I must groupByKey. Afterwards I call foreachRDD foreachPartition write to an external datastore. My only concern with this is if it's future proof? I know groupByKey by default uses the hashPartitioner. I have printed out the

Re: py-files (and others?) not properly set up in cluster-mode Spark Yarn job?

2015-05-18 Thread Marcelo Vanzin
Hi Shay, Yeah, that seems to be a bug; it doesn't seem to be related to the default FS nor compareFs either - I can reproduce this with HDFS when copying files from the local fs too. In yarn-client mode things seem to work. Could you file a bug to track this? If you don't have a jira account I

RE: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
Thanks for the heads up mate. On 18 May 2015 19:08, Evo Eftimov evo.efti...@isecc.com wrote: Ooow – that is essentially the custom feedback loop mentioned in my previous emails in generic Architecture Terms and what you have done is only one of the possible implementations moreover based on

RE: Processing multiple columns in parallel

2015-05-18 Thread Needham, Guy
How about making the range in the for loop parallelised? The driver will then kick off the word counts independently. Regards, Guy Needham | Data Discovery Virgin Media | Technology and Transformation | Data Bartley Wood Business Park, Hook, Hampshire RG27 9UP D 01256 75 3362 I welcome VSRE

Re: number of executors

2015-05-18 Thread Sandy Ryza
*All On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote: Sorry, them both are assigned task

Re: number of executors

2015-05-18 Thread Sandy Ryza
Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote: Sorry, them both are assigned task actually. Aggregated Metrics by Executor Executor IDAddressTask TimeTotal TasksFailed

Re: number of executors

2015-05-18 Thread edward cui
I actually have the same problem, but I am not sure whether it is a spark problem or a Yarn problem. I set up a five nodes cluster on aws emr, start yarn daemon on the master (The node manager will not be started on default on the master, I don't want to waste any resource since I have to pay).

Re: Spark streaming over a rest API

2015-05-18 Thread Akhil Das
Why not use sparkstreaming to do the computation and dump the result somewhere in a DB perhaps and take it from there? Thanks Best Regards On Mon, May 18, 2015 at 7:51 PM, juandasgandaras juandasganda...@gmail.com wrote: Hello, I would like to use spark streaming over a REST api to get

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
My pleasure young man, i will even go beynd the so called heads up and send you a solution design for Feedback Loop preventing spark streaming app clogging and resource depletion and featuring machine learning based self-tunning AND which is not zookeeper based and hence offers lower latency

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-18 Thread Olivier Girardot
PR is opened : https://github.com/apache/spark/pull/6237 Le ven. 15 mai 2015 à 17:55, Olivier Girardot ssab...@gmail.com a écrit : yes, please do and send me the link. @rxin I have trouble building master, but the code is done... Le ven. 15 mai 2015 à 01:27, Haopu Wang hw...@qilinsoft.com a

Re: Error communicating with MapOutputTracker

2015-05-18 Thread Imran Rashid
On Fri, May 15, 2015 at 5:09 PM, Thomas Gerber thomas.ger...@radius.com wrote: Now, we noticed that we get java heap OOM exceptions on the output tracker when we have too many tasks. I wonder: 1. where does the map output tracker live? The driver? The master (when those are not the same)? 2.

Re: applications are still in progress?

2015-05-18 Thread Imran Rashid
Most likely, you never call sc.stop(). Note that in 1.4, this will happen for you automatically in a shutdown hook, taken care of by https://issues.apache.org/jira/browse/SPARK-3090 On Wed, May 13, 2015 at 8:04 AM, Yifan LI iamyifa...@gmail.com wrote: Hi, I have some applications

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread Davies Liu
SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show) threading.Thread(target=job).start() On Mon, May 18,

Re: parallelism on binary file

2015-05-18 Thread Imran Rashid
You can use sc.hadoopFile (or any of the variants) to do what you want. They even let you reuse your existing HadoopInputFormats. You should be able to mimic your old use with MR just fine. sc.textFile is just a convenience method which sits on top. imran On Fri, May 8, 2015 at 12:03 PM, tog

Re: Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-18 Thread Joseph Bradley
Hi Justin, It sound like you're on the right track. The best way to write a custom Evaluator will probably be to modify an existing Evaluator as you described. It's best if you don't remove the other code, which handles parameter set/get and schema validation. Joseph On Sun, May 17, 2015 at

TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I am trying to print a basic twitter stream and receiving the following error: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching

Re: number of executors

2015-05-18 Thread xiaohe lan
Yeah, I read that page before, but it does not mention the options should come before the application jar. Actually, if I put the --class option before the application jar, I will get ClassNotFoundException. Anyway, thanks again Sandy. On Tue, May 19, 2015 at 11:06 AM, Sandy Ryza

Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I think I found the answer - http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html Do I have no way of running this in Windows locally? On Mon, May 18, 2015 at 10:44 PM, Justin Pihony justin.pih...@gmail.com wrote:

RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for the response. But I could not see fillna function in DataFrame class. [cid:image001.png@01D0920E.32B14460] Is it available in some specific version of Spark sql. This is what I have in my pom.xml dependency groupIdorg.apache.spark/groupId

Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I'm not 100% sure that is causing a problem, though. The stream still starts, but is giving blank output. I checked the environment variables in the ui and it is running local[*], so there should be no bottleneck there. On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com

Re: number of executors

2015-05-18 Thread xiaohe lan
Hi Sandy, Thanks for your information. Yes, spark-submit --master yarn --num-executors 5 --executor-cores 4 target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp is working awesomely. Is there any documentations pointing to this ? Thanks, Xiaohe On Tue, May 19, 2015 at 12:07 AM,

Re: number of executors

2015-05-18 Thread Sandy Ryza
Awesome! It's documented here: https://spark.apache.org/docs/latest/submitting-applications.html -Sandy On Mon, May 18, 2015 at 8:03 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi Sandy, Thanks for your information. Yes, spark-submit --master yarn --num-executors 5 --executor-cores 4

Spark Streaming graceful shutdown in Spark 1.4

2015-05-18 Thread Dibyendu Bhattacharya
Hi, Just figured out that if I want to perform graceful shutdown of Spark Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no longer works . As in Spark 1.4 there is Utils.addShutdownHook defined for Spark Core, that gets anyway called , which leads to graceful shutdown

Re: spark log field clarification

2015-05-18 Thread Imran Rashid
depends what you mean by output data. Do you mean: * the data that is sent back to the driver? that is result size * the shuffle output? that is in Shuffle Write Metrics * the data written to a hadoop output format? that is in Output Metrics On Thu, May 14, 2015 at 2:22 PM, yanwei

Re: LogisticRegressionWithLBFGS with large feature set

2015-05-18 Thread Imran Rashid
I'm not super familiar with this part of the code, but from taking a quick look: a) the code creates a MultivariateOnlineSummarizer, which stores 7 doubles per feature (mean, max, min, etc. etc.) b) The limit is on the result size from *all* tasks, not from one task. You start with 3072 tasks c)

Re: pass configuration parameters to PySpark job

2015-05-18 Thread Davies Liu
In PySpark, it serializes the functions/closures together with used global values. For example, global_param = 111 def my_map(x): return x + global_param rdd.map(my_map) - Davies On Mon, May 18, 2015 at 7:26 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am looking a way to

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is corrupted

2015-05-18 Thread Imran Rashid
Looks like this exception is after many more failures have occurred. It is already on attempt 6 for stage 7 -- I'd try to find out why attempt 0 failed. This particular exception is probably a result of corruption that can happen when stages are retried, that I'm working on addressing in

Re: Spark on Yarn : Map outputs lifetime ?

2015-05-18 Thread Imran Rashid
Neither of those two. Instead, the shuffle data is cleaned up when the stage they are from get GC'ed by the jvm. that is, when you are no longer holding any references to anything which points to the old stages, and there is an appropriate gc event. The data is not cleaned up right after the

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
Hi Suman, For maxIterations, are you using the DenseKMeans.scala example code? (I'm guessing yes since you mention the command line.) If so, then you should be able to specify maxIterations via an extra parameter like --numIterations 50 (note the example uses numIterations in the current master

Re: Broadcast variables can be rebroadcast?

2015-05-18 Thread Imran Rashid
Rather than updating the broadcast variable, can't you simply create a new one? When the old one can be gc'ed in your program, it will also get gc'ed from spark's cache (and all executors). I think this will make your code *slightly* more complicated, as you need to add in another layer of

Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Bill Jay
Hi all, I am reading the docs of receiver-based Kafka consumer. The last parameters of KafkaUtils.createStream is per topic number of Kafka partitions to consume. My question is, does the number of partitions for topic in this parameter need to match the number of partitions in Kafka. For

Re: Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Saisai Shao
HI Bill, You don't need to match the number of thread to the number of partitions in the specific topic, for example, you have 3 partitions in topic1, but you only set 2 threads, ideally 1 thread will receive 2 partitions and another thread for the left one partition, it depends on the scheduling

Re: FetchFailedException and MetadataFetchFailedException

2015-05-18 Thread Imran Rashid
Hi, can you take a look at the logs and see what the first error you are getting is? Its possible that the file doesn't exist when that error is produced, but it shows up later -- I've seen similar things happen, but only after there have already been some errors. But, if you see that in the

Re: MLLib SVMWithSGD is failing for large dataset

2015-05-18 Thread Xiangrui Meng
Reducing the number of instances won't help in this case. We use the driver to collect partial gradients. Even with tree aggregation, it still puts heavy workload on the driver with 20M features. Please try to reduce the number of partitions before training. We are working on a more scalable

Re: StandardScaler failing with OOM errors in PySpark

2015-05-18 Thread Xiangrui Meng
AFAIK, there are two places where you can specify the driver memory. One is via spark-summit --driver-memory and the other is via spark.driver.memory in spark-defaults.conf. Please try these approaches and see whether they work or not. You can find detailed instructions at

Re: bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-05-18 Thread Xiangrui Meng
LogisticRegressionWithSGD doesn't support multi-class. Please use LogisticRegressionWithLBFGS instead. -Xiangrui On Mon, Apr 27, 2015 at 12:37 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: With the Python APIs, the available arguments I got (using inspect module) are the following:

Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am using spark-sql to read a CSV file and write it as parquet file. I am building the schema using the following code. String schemaString = a b c; ListStructField fields = new ArrayListStructField(); MetadataBuilder mb = new MetadataBuilder();

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg
Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-18 Thread Steve Loughran
On 16 May 2015, at 04:39, Anton Brazhnyk anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote: For me it wouldn’t help I guess, because those newer classes would still be loaded by different classloader. What did work for me with 1.3.1 – removing of those classes from Spark’s jar

NullPointerException when accessing broadcast variable in DStream

2015-05-18 Thread hotienvu
Hi I'm trying to use broadcast variables in my Spark streaming program. val conf = new SparkConf().setMaster(SPARK_MASTER).setAppName(APPLICATION_NAME) val ssc = new StreamingContext(conf, Seconds(1)) val LIMIT = ssc.sparkContext.broadcast(5L) println(LIMIT.value) // this print 5 val

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread MEETHU MATHEW
Hi,I think you cant supply an initial set of centroids to kmeans Thanks Regards, Meethu M On Friday, 15 May 2015 12:37 AM, Suman Somasundar suman.somasun...@oracle.com wrote: !--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4;}

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
You can use spark.streaming.receiver.maxRate not set Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread ayan guha
Hi Give a try with dtaFrame.fillna function to fill up missing column Best Ayan On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan ananda.muru...@honeywell.com wrote: Hi, I am using spark-sql to read a CSV file and write it as parquet file. I am building the schema

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
And if you want to genuinely “reduce the latency” (still within the boundaries of the micro-batch) THEN you need to design and finely tune the Parallel Programming / Execution Model of your application. The objective/metric here is: a) Consume all data within your selected micro-batch