Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-08-04 Thread Jeff Zhang
Please check the node manager logs to see why the container is killed. On Mon, Aug 3, 2015 at 11:59 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi all any help will be much appreciated my spark job runs fine but in the middle it starts loosing executors because of netafetchfailed exception

Fwd: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Priya Ch
Yes...union would be one solution. I am not doing any aggregation hence reduceByKey would not be useful. If I use groupByKey, messages with same key would be obtained in a partition. But groupByKey is very expensive operation as it involves shuffle operation. My ultimate goal is to write the

Re: TFIDF Transformation

2015-08-04 Thread Yanbo Liang
It can not translate the number back to the word except you store the in map by yourself. 2015-07-31 1:45 GMT+08:00 hans ziqiu li thenewh...@gmail.com: Hello spark users! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction. The

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Igor Berman
sorry, can't disclose info about my prod cluster nothing jumps into my mind regarding your config we don't use lz4 compression, don't know what is spark.deploy.spreadOut(there is no documentation regarding this) If you are sure that you don't have memory leak in your business logic I would try

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Yanbo Liang
It looks like the predicted result just opposite with expectation, so could you check whether the label is right? Or could you share several data which can help to reproduce this output? 2015-08-03 19:36 GMT+08:00 Barak Gitsis bar...@similarweb.com: hi, I've run into some poor RF behavior,

Total delay per batch in a CSV file

2015-08-04 Thread allonsy
Hi everyone, I'm working with Spark Streaming, and I need to perform some offline performance measures. What I'd like to have is a CSV file that reports something like this: *Batch number/timestampInput SizeTotal Delay* which is in fact similar to what the UI outputs. I tried

Re: Delete NA in a dataframe

2015-08-04 Thread Peter Rudenko
Hi Clark, the problem is that in this dataset null values represented as NA marker. Spark-csv doesn't have configurable null values marker (i've made a PR with it some time ago: https://github.com/databricks/spark-csv/pull/76). So one option for you is to do post filtering, something like

Delete NA in a dataframe

2015-08-04 Thread clark djilo kuissu
Hello, I try to magage NA in this dataset. I import my dataset with the com.databricks.spark.csv package When I do this: allyears2k.na.drop() I have no result. Can you help me please ? Regards, ---   The dataset - dataset: 

Re: Schedule lunchtime today for a free webinar IoT data ingestion in Spark Streaming using Kaa 11 a.m. PDT (2 p.m. EDT)

2015-08-04 Thread orozvadovskyy
Hi there! If you missed our webinar on IoT data ingestion in Spark with KaaIoT, see the video and slides here: http://goo.gl/VMyQ1M We recorded our webinar on “IoT data ingestion in Spark Streaming using Kaa” for those who couldn’t see it live or who would like to refresh what they have

AW: Twitter live Streaming

2015-08-04 Thread Filli Alem
Hi Sadaf, Im currently struggling with Twitter Streaming as well. I cant get it working using the simple setup bellow. I use spark 1.2 and I replaced twitter4j v3 with v4. Am I doing something wrong? How are you doing this? twitter4j.conf.Configuration conf = new

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Deepesh Maheshwari
Hi Sean, Thanks for the information and clarification. On Tue, Aug 4, 2015 at 1:04 PM, Sean Owen so...@cloudera.com wrote: It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy/LZO compression which are implemented as native libraries under the hood.

Re: Twitter live Streaming

2015-08-04 Thread Enno Shioji
If you want to do it through streaming API you have to pay Gnip; it's not free. You can go through non-streaming Twitter API and convert it to stream yourself though. On 4 Aug 2015, at 09:29, Sadaf sa...@platalytics.com wrote: Hi Is there any way to get all old tweets since when the

Re: How do I Process Streams that span multiple lines?

2015-08-04 Thread Akhil Das
If you are using Kafka, then you can basically push an entire file as a message to Kafka. In that case in your DStream, you will receive the single message which is the contents of the file and it can of course span multiple lines. Thanks Best Regards On Mon, Aug 3, 2015 at 8:27 PM, Spark

giving offset in spark sql

2015-08-04 Thread Hafiz Mujadid
Hi all! I want to skip first n rows from a dataframe? This is done in normal sql using offset keyword. How can we achieve in spark sql? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/giving-offset-in-spark-sql-tp24130.html Sent from the Apache

Spark SQL support for Hive 0.14

2015-08-04 Thread Ishwardeep Singh
Hi, Does spark SQL support Hive 0.14? The documentation refers to Hive 0.13. Is there a way to compile spark with Hive 0.14? Currently we are using Spark 1.3.1. Thanks -- View this message in context:

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Deepesh Maheshwari
Can you elaborate about the things this native library covering. One you mentioned accelerated compression. It would be very helpful if you can give any useful to link to read more about it. On Tue, Aug 4, 2015 at 12:56 PM, Sean Owen so...@cloudera.com wrote: You can ignore it entirely. It

NoSuchMethodError : org.apache.spark.streaming.scheduler.StreamingListenerBus.start()V

2015-08-04 Thread Deepesh Maheshwari
Hi, I am trying to read data from kafka and process it using spark. i have attached my source code , error log. For integrating kafka, i have added dependency in pom.xml dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
You can ignore it entirely. It just means you haven't installed and configured native libraries for things like accelerated compression, but it has no negative impact otherwise. On Tue, Aug 4, 2015 at 8:11 AM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, When i run the spark

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy/LZO compression which are implemented as native libraries under the hood. Spark doesn't necessarily use these anyway; it's from the Hadoop libs. On Tue, Aug 4, 2015 at 8:30 AM, Deepesh Maheshwari

Re: Running multiple batch jobs in parallel using Spark on Mesos

2015-08-04 Thread Akhil Das
One approach would be to use a Jobserver in between, create SparkContexts in it. Lets say you create two, one which is configured to run on coarse-grained and another set to fine-grained. Let the high priority jobs hit the coarse-grained SparkContext and the other jobs use the fine-grained one.

Re: large scheduler delay in pyspark

2015-08-04 Thread Davies Liu
On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote: Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])]

Re: Schema evolution in tables

2015-08-04 Thread Brandon White
Sim did you find anything? :) On Sun, Jul 26, 2015 at 9:31 AM, sim s...@swoop.com wrote: The schema merging http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging section of the Spark SQL documentation shows an example of schema evolution in a partitioned table. Is

Unable to load native-hadoop library for your platform

2015-08-04 Thread Deepesh Maheshwari
Hi, When i run the spark locally on windows it gives below hadoop library error. I am using below spark version. dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.4.1/version /dependency 2015-08-04 12:22:23,463

Twitter live Streaming

2015-08-04 Thread Sadaf
Hi Is there any way to get all old tweets since when the account was created using spark streaming and twitters api? Currently my connector is showing those tweets that get posted after the program runs. I've done this task using spark streaming and a custom receiver using twitter user api.

Re: Writing to HDFS

2015-08-04 Thread Akhil Das
Just to add rdd.take(1) won't trigger the entire computation, it will just pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files to trigger the complete pipeline. How many partitions do you see in the last stage? Thanks Best Regards On Tue, Aug 4, 2015 at 7:10 AM, ayan guha

Transform MongoDB Aggregation into Spark Job

2015-08-04 Thread Deepesh Maheshwari
Hi, I am new to Apache Spark and exploring spark+kafka intergration to process data using spark which i did earlier in MongoDB Aggregation. I am not able to figure out to handle my use case. Mongo Document : { _id : ObjectId(55bfb3285e90ecbfe37b25c3), url :

Re?? About memory leak in spark 1.4.1

2015-08-04 Thread Sea
How much machines are there in your standalone cluster? I am not using tachyon. GC can not help me... Can anyone help ? my configuration: spark.deploy.spreadOut false spark.eventLog.enabled true spark.executor.cores 24 spark.ui.retainedJobs 10 spark.ui.retainedStages 10

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Gerard Maas
(removing dev from the to: as not relevant) it would be good to see some sample data and the cassandra schema to have a more concrete idea of the problem space. Some thoughts: reduceByKey could still be used to 'pick' one element. example of arbitrarily choosing the first one: reduceByKey{case

Re: TFIDF Transformation

2015-08-04 Thread clark djilo kuissu
Hi,  I had the same problem and I didn't found the solution. I used Word2Vec instead. I am interessed by the solution of this problem of how to go back from the TF-IDF hashing to word. Regards, Clark Le Mardi 4 août 2015 13h03, Yanbo Liang yblia...@gmail.com a écrit : It can not 

Re: Setting a stage timeout

2015-08-04 Thread William Kinney
Yes I upgraded but I would still like to set an overall stage timeout. Does that exist? On Fri, Jul 31, 2015 at 1:13 PM, Ted Yu yuzhih...@gmail.com wrote: The referenced bug has been fixed in 1.4.0, are you able to upgrade ? Cheers On Fri, Jul 31, 2015 at 10:01 AM, William Kinney

Re: Safe to write to parquet at the same time?

2015-08-04 Thread Cheng Lian
It should be safe for Spark 1.4.1 and later versions. Now Spark SQL adds a job-wise UUID to output file names to distinguish files written by different write jobs. So those two write jobs you gave should play well with each other. And the job committed later will generate a summary file for

Re: Parquet SaveMode.Append Trouble.

2015-08-04 Thread Cheng Lian
You need to import org.apache.spark.sql.SaveMode Cheng On 7/31/15 6:26 AM, satyajit vegesna wrote: Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile(/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet) Have

Re: spark streaming max receiver rate doubts

2015-08-04 Thread Cody Koeninger
Those jobs will still be created for each valid time, they just may not have many messages in them On Mon, Aug 3, 2015 at 11:11 PM, Shushant Arora shushantaror...@gmail.com wrote: 1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't set

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Ted Yu
w.r.t. spark.deploy.spreadOut , here is the scaladoc: // As a temporary workaround before better ways of configuring memory, we allow users to set // a flag that will perform round-robin scheduling across the nodes (spreading out each app // among all the nodes) instead of trying to

Re: Twitter Connector-Spark Streaming

2015-08-04 Thread Akhil Das
You will have to write your own consumer for pulling your custom feeds, and then you can do a union (customfeedDstream.union(twitterStream)) with the twitter stream api. Thanks Best Regards On Tue, Aug 4, 2015 at 2:28 PM, Sadaf Khan sa...@platalytics.com wrote: Thanks alot :) One more thing

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Barak Gitsis
maybe try reducing spark.executor.cores perhaps your tasks have large offheap overhead and better have less tasks running in parallel is it streaming job? On Tue, Aug 4, 2015 at 2:14 PM Igor Berman igor.ber...@gmail.com wrote: sorry, can't disclose info about my prod cluster nothing jumps

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
Yes it does, in fact it's probably going to be one of the more expensive shuffles you could trigger. On Mon, Aug 3, 2015 at 12:56 PM, Meihua Wu rotationsymmetr...@gmail.com wrote: Does RDD.cartesian involve shuffling? Thanks!

Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-04 Thread Upen N
Hi Guru, It was a no brainer issue. I had to create HDFS user ec2-user to make it work. It worked like a charm after that. Thanks Upender On Mon, Aug 3, 2015 at 10:27 PM, Guru Medasani gdm...@gmail.com wrote: Hi Upen, Did you deploy the client configs after assigning the gateway roles? You

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
One options is to use the coalesce method in the RDD class. Mohammed From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Tuesday, August 4, 2015 7:23 PM To: user Subject: Combining Spark Files with saveAsTextFile What is the best way to make saveAsTextFile save as only a single file?

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
Just to further clarify, you can first call coalesce with argument 1 and then call saveAsTextFile. For example, rdd.coalesce(1).saveAsTextFile(...) Mohammed From: Mohammed Guller Sent: Tuesday, August 4, 2015 9:39 PM To: 'Brandon White'; user Subject: RE: Combining Spark Files with

control the number of reducers for groupby in data frame

2015-08-04 Thread Fang, Mike
Hi, Does anyone know how I could control the number of reducer when we do operation such as groupie For data frame? I could set spark.sql.shuffle.partitions in sql but not sure how to do in df.groupBy(XX) api. Thanks, Mike

Turn Off Compression for Textfiles

2015-08-04 Thread Brandon White
How do you turn off gz compression for saving as textfiles? Right now, I am reading ,gz files and it is saving them as .gz. I would love to not compress them when I save. 1) DStream.saveAsTextFiles() //no compression 2) RDD.saveAsTextFile() //no compression Any ideas?

Re: Turn Off Compression for Textfiles

2015-08-04 Thread Philip Weaver
The .gz extension indicates that the file is compressed with gzip. Choose a different extension (e.g. .txt) when you save them. On Tue, Aug 4, 2015 at 7:00 PM, Brandon White bwwintheho...@gmail.com wrote: How do you turn off gz compression for saving as textfiles? Right now, I am reading ,gz

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-08-04 Thread Yanbo Liang
The old mllib API will use RandomForest.trainClassifier() to train a RandomForestModel; the new mllib API (AKA ML) will use RandomForestClassifier.train() to train a RandomForestClassificationModel. They will produce the same result for a given dataset. 2015-07-31 1:34 GMT+08:00 Bryan Cutler

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Jim Green
And also https://issues.apache.org/jira/browse/SPARK-3106 This one is still open. On Tue, Aug 4, 2015 at 6:12 PM, Jim Green openkbi...@gmail.com wrote: *Symotom:* Even sample job fails: $ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10 Pi is roughly 3.140636 ERROR

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at

Re: Spark SQL unable to recognize schema name

2015-08-04 Thread Ted Yu
This should have been fixed by: [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support The fix is in the upcoming 1.5.0 FYI On Tue, Aug 4, 2015 at 11:45 AM, Mohammed Guller moham...@glassbeam.com wrote: Hi - I am running the Thrift JDBC/ODBC

Re: Problem submiting an script .py against an standalone cluster.

2015-08-04 Thread Ford Farline
The code is very simple, just a couple of lines. When i lanch it runs in local but not in cluster. sc = SparkContext(local, Tech Companies Feedback) beginning_time = datetime.now() time.sleep(60) print datetime.now() - beginning_time sc.stop() Thanks for your interest, Gonzalo On Fri,

Re: How does the # of tasks affect # of threads?

2015-08-04 Thread Connor Zanin
Elkhan, Thank you for the response. This was a great answer. On Tue, Aug 4, 2015 at 1:47 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi Connor, Spark creates cached thread pool in Executor

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Steve Loughran
Think it may be needed on Windows, certainly if you start trying to work with local files. On 4 Aug 2015, at 00:34, Sean Owen so...@cloudera.com wrote: It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy/LZO compression which are implemented as

Re: Spark SQL support for Hive 0.14

2015-08-04 Thread Michael Armbrust
I'll add that while Spark SQL 1.5 compiles against Hive 1.2.1, it has support for reading from metastores for Hive 0.12 - 1.2.1 On Tue, Aug 4, 2015 at 9:59 AM, Steve Loughran ste...@hortonworks.com wrote: Spark 1.3.1 1.4 only support Hive 0.13 Spark 1.5 is going to be released against Hive

Spark SQL unable to recognize schema name

2015-08-04 Thread Mohammed Guller
Hi - I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem when querying tables using fully qualified table names(schemaName.tableName). The following query works fine from the beeline tool: SELECT * from test; However, the following query throws an exception, even

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Meihua Wu
Thanks, Richard! I basically have two RDD's: A and B; and I need to compute a value for every pair of (a, b) for a in A and b in B. My first thought is cartesian, but involves expensive shuffle. Any alternatives? How about I convert B to an array and broadcast it to every node (assuming B is

No Twitter Input from Kafka to Spark Streaming

2015-08-04 Thread narendra
My application takes Twitter4j tweets and publishes those to a topic in Kafka. Spark Streaming subscribes to that topic for processing. But in actual, Spark Streaming is not able to receive tweet data from Kafka so Spark Streaming is running empty batch jobs with out input and I am not able to see

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Patrick Lam
Yes, I rechecked and the label is correct. As you can see in the code posted, I used the exact same features for all the classifiers so unless rf somehow switches the labels, it should be correct. I have posted a sample dataset and sample code to reproduce what I'm getting here:

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-04 Thread Cody Koeninger
Have you tried using the console consumer to see if anything is actually getting published to that topic? On Tue, Aug 4, 2015 at 11:45 AM, narendra narencs...@gmail.com wrote: My application takes Twitter4j tweets and publishes those to a topic in Kafka. Spark Streaming subscribes to that

Re: Transform MongoDB Aggregation into Spark Job

2015-08-04 Thread Jörn Franke
Hi, I think the combination of Mongodb and Spark is a little bit unlucky. Why don't you simply use mongodb? If you want to process a lot of data you should use hdfs or cassandra as storage. Mongodb is not suitable for heterogeneous processing of large scale data. Best regards Best regards,

Re: Repartition question

2015-08-04 Thread Richard Marscher
Hi, it is possible to control the number of partitions for the RDD without calling repartition by setting the max split size for the hadoop input format used. Tracing through the code, XmlInputFormat extends FileInputFormat which determines the number of splits (which NewHadoopRdd uses to

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread gaurav sharma
Ideally the 2 messages read from kafka must differ on some parameter atleast, or else they are logically same As a solution to your problem, if the message content is same, u cud create a new field UUID, which might play the role of partition key while inserting the 2 messages in Cassandra Msg1

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
That is the only alternative I'm aware of, if either A or B are small enough to broadcast then you'd at least be done cartesian products all locally without needing to also transmit and shuffle A. Unless spark somehow optimizes cartesian product and only transfers the smaller RDD across the

Re: Spark SQL support for Hive 0.14

2015-08-04 Thread Steve Loughran
Spark 1.3.1 1.4 only support Hive 0.13 Spark 1.5 is going to be released against Hive 1.2.1; it'll skip Hive .14 support entirely and go straight to the currently supported Hive release. See SPARK-8064 for the gory details On 3 Aug 2015, at 23:01, Ishwardeep Singh

Re: How does the # of tasks affect # of threads?

2015-08-04 Thread Elkhan Dadashov
Hi Connor, Spark creates cached thread pool in Executor https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala for executing the tasks: // Start worker thread pool *private val threadPool = ThreadUtils.newDaemonCachedThreadPool(Executor task

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
Oh good point, does the Windows integration need native libs for POSIX-y file system access? I know there are some binaries shipped for this purpose but wasn't sure if that's part of what's covered in the native libs message. On Tue, Aug 4, 2015 at 6:01 PM, Steve Loughran ste...@hortonworks.com

Re: scheduler delay time

2015-08-04 Thread maxdml
You'd need to provide information such as executor configuration (#cores, memory size). You might have less scheduler delay with smaller, but more numerous executors, than the contrary. -- View this message in context:

Re: How to increase parallelism of a Spark cluster?

2015-08-04 Thread Richard Marscher
I think you did a good job of summarizing terminology and describing spark's operation. However #7 is inaccurate if I am interpreting correctly. The scheduler schedules X tasks from the current stage across all executors, where X is the the number of cores assigned to the application (assuming

Combining Spark Files with saveAsTextFile

2015-08-04 Thread Brandon White
What is the best way to make saveAsTextFile save as only a single file?

Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Jim Green
*Symotom:* Even sample job fails: $ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10 Pi is roughly 3.140636 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(xxx,) not found WARN ConnectionManager: All connections not cleaned up Found