Re: Networking issues with Spark on EC2

2015-09-25 Thread SURAJ SHETH
Hi, Nopes. I was trying to use EC2(due to a few constraints) where I faced the problem. With EMR, it works flawlessly. But, I would like to go back and use EC2 if I can fix this issue. Has anybody set up a spark cluster using plain EC2 machines. What steps did you follow? Thanks and Regards,

Re: Networking issues with Spark on EC2

2015-09-25 Thread Natu Lauchande
Hi, Are you using EMR ? Natu On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH wrote: > Hi Ankur, > Thanks for the reply. > This is already done. > If I wait for a long amount of time(10 minutes), a few tasks get > successful even on slave nodes. Sometime, a fraction of the

Re: Networking issues with Spark on EC2

2015-09-25 Thread SURAJ SHETH
Hi Ankur, Thanks for the reply. This is already done. If I wait for a long amount of time(10 minutes), a few tasks get successful even on slave nodes. Sometime, a fraction of the tasks(20%) are completed on all the machines in the initial 5 seconds and then, it slows down drastically. Thanks and

Generic DataType in UDAF

2015-09-25 Thread Ritesh Agrawal
Hi all, I am trying to learn about UDAF and implemented a simple reservoir sample UDAF. It's working fine. However I am not able to figure out what DataType should I use so that its can deal with all DataTypes (simple and complex). For instance currently I have defined my input schema as def

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
Either setting it programatically doesn't work: sparkConf.setIfMissing("class", "...Main") In my current setting moving main to another package requires to propagate change to deploy scripts. Doesn't matter I will find some other way. Petr On Fri, Sep 25, 2015 at 4:40 PM, Petr Novak

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
I am seeing a similar issue when reading from Kafka. I have a single Kafka broker with 1 topic and 10 partitions on a separate machine. I have a three-node spark cluster, and verified that all workers are registered with the master. I'm initializing Kafka using a similar method to this article:

Re: Java Heap Space Error

2015-09-25 Thread Yusuf Can Gürkan
Hello, It worked like a charm. Thank you very much. Some userid’s were null that’s why many records go to userid ’null’. When i put a where clause: userid != ‘null’, it solved problem. > On 24 Sep 2015, at 22:43, java8964 wrote: > > I can understand why your first query

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
Good catch, I was not aware of this setting. I’m wondering though if it also generates a shuffle or if the data is still processed by the node on which it’s ingested - so that you’re not gated by the number of cores on one machine. -adrian On 9/25/15, 5:27 PM, "Silvio Fiorito"

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Cody Koeninger
Storing passbacks transactionally with results in your own data store, with a schema that makes sense for you, is the optimal solution. On Fri, Sep 25, 2015 at 11:05 AM, Radu Brumariu wrote: > Right, I understand why the exceptions happen. > However, it seems less useful to

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Radu Brumariu
Right, I understand why the exceptions happen. However, it seems less useful to have a checkpointing that only works in the case of an application restart. IMO, code changes happen quite often, and not being able to pick up where the previous job left off is quite a bit of a hinderance. The

--class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
Ortherwise it seems it tries to load from a checkpoint which I have deleted and cannot be found. Or it should work and I have wrong something else. Documentation doesn't mention option with jar manifest, so I assume it doesn't work this way. Many thanks, Petr

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-25 Thread Tim Chen
Hi Utkarsh, What is your job placement like when you run fine grain mode? You said coarse grain mode only ran with one node right? And when the job is running could you open the Spark webui and get stats about the heap size and other java settings? Tim On Thu, Sep 24, 2015 at 10:56 PM, Utkarsh

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Zhiliang Zhu
It seems that is due to spark  SPARK_LOCAL_IP setting.export SPARK_LOCAL_IP=localhost will not work. Then, how it would be set. Thank you all~~ On Friday, September 25, 2015 5:57 PM, Zhiliang Zhu wrote: Hi Steve, Thanks a lot for your reply. That

Troubles interacting with different version of Hive metastore

2015-09-25 Thread Ferran Galí
Hello, I'm trying to start the SparkSQL thriftserver over YARN, connecting it to the 1.1.0-cdh5.4.3 hive metastore that we already have in production. I downloaded the latest version of Spark (1.5.0), I just followed the instructions from the documentation

Unreachable dead objects permanently retained on heap

2015-09-25 Thread James Aley
Hi, We have an application that submits several thousands jobs within the same SparkContext, using a thread pool to run about 50 in parallel. We're running on YARN using Spark 1.4.1 and seeing a problem where our driver is killed by YARN due to running beyond physical memory limits (no Java OOM

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
I tried but I'm getting the same error (task not serializable) > On 25 בספט׳ 2015, at 20:10, Ted Yu wrote: > > Is the Schema.parse() call expensive ? > > Can you call it in the closure ? > >> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv >>

Re: Weird worker usage

2015-09-25 Thread N B
Hi Akhil, I do have 25 partitions being created. I have set the spark.default.parallelism property to 25. Batch size is 30 seconds and block interval is 1200 ms which also gives us roughly 25 partitions from the input stream. I can see 25 partitions being created and used in the Spark UI also.

Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin
Seems like you have "hive.server2.enable.doAs" enabled; you can either disable it, or configure hs2 so that the user running the service ("hadoop" in your case) can impersonate others. See: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html On Fri, Sep 25,

Re: Distance metrics in KMeans

2015-09-25 Thread sethah
It looks like the distance metric is hard coded to the L2 norm (euclidean distance) in MLlib. As you may expect, you are not the first person to desire other metrics and there has been some prior effort. Please reference this PR: https://github.com/apache/spark/pull/2634 And corresponding JIRA:

Distance metrics in KMeans

2015-09-25 Thread bobtreacy
Is it possible to use other distance metrics than Euclidean (e.g. Tanimoto, Manhattan) with MLlib KMeans? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823.html Sent from the Apache Spark User List mailing list archive at

how to control timeout in node failure for spark task ?

2015-09-25 Thread roy
Hi, We are running Spark 1.3 on CDH 5.4.1 on top of YARN. we want to know how do we control task timeout when node fails and task running on it should be restarted on another node. at present job wait for approximately 10 min to restart the task were running on failed node.

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Yes, the partition IDs are the same. As far as the failure / subclassing goes, you may want to keep an eye on https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the suggestions in there will end up going anywhere. On Fri, Sep 25, 2015 at 3:01 PM, Neelesh wrote:

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks. Ill keep an eye on this. Our implementation of the DStream basically accepts a function to compute current offsets. The implementation of the function fetches list of topics from zookeeper once in while. It then adds consumer offsets for newly added topics with the currentOffsets thats in

Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread java8964
Hi, Spark Users: I have a problem related to Spark cannot recognize the string type in the Parquet schema generated by Hive. Version of all components: Spark 1.3.1Hive 0.12.0Parquet 1.3.2 I generated a detail low level table in the Parquet format using MapReduce java code. This table can be read

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true when reading Parquet files containing strings generated by Hive. This is actually a bug of parquet-hive. When generating Parquet schema for a string field, Parquet requires a "UTF8" annotation, something like: message

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 0.14.0. So the SQL option I mentioned is mostly used for reading legacy Parquet files generated by older versions of Hive. Cheng On 9/25/15 2:42 PM, Cheng Lian wrote: Please set the the SQL option

Handle null/NaN values in mllib classifier

2015-09-25 Thread matd
Hi folks, I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest). In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
Looking at this further, it appears that my Spark Context is not correctly setting the Master name. I see the following in logs: 15/09/25 16:45:42 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp"

hive on spark query error

2015-09-25 Thread Garry Chen
Hi All, I am following https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started? to setup hive on spark. After setup/configuration everything startup I am able to show tables but when executing sql statement within beeline I got error. Please help and

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks Petr, Cody. This is a reasonable place to start for me. What I'm trying to achieve stream.foreachRDD {rdd=> rdd.foreachPartition { p=> Try(myFunc(...)) match { case Sucess(s) => updatewatermark for this partition //of course, expectation is that it will work only if

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Radu Brumariu
Wouldn't the same case be made for checkpointing in general ? What I am trying to say, is that this particular situation is part of the general checkpointing use case, not an edge case. I would like to understand why shouldn't the checkpointing mechanism, already existent in Spark, handle this

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
For the 1-1 mapping case, can I use TaskContext.get().partitionId as an index in to the offset ranges? For the failure case, yes, I'm subclassing of DirectKafkaInputDStream. As for failures, different partitions in the same batch may be talking to different RDBMS servers due to multitenancy - a

Re: Reading Hive Tables using SQLContext

2015-09-25 Thread Michael Armbrust
Eventually I'd like to eliminate HiveContext, but for now I just recommend that most users use it instead of SQLContext. On Thu, Sep 24, 2015 at 5:41 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Thanks Michael. Just want to check if there is a roadmap to include Hive >

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Gavin Yue
I think I found the problem. Have to change the yarn capacity scheduler to use DominantResourceCalculator Thanks! On Fri, Sep 25, 2015 at 4:54 AM, Akhil Das wrote: > Which version of spark are you having? Can you also check whats set in > your

Re: Convert Vector to RDD[Double]

2015-09-25 Thread Sourigna Phetsarath
import org.apache.spark.mllib.linalg._ val v = Vectors.dense(1.0,2.0) val rdd = sc.parallelize(v.toArray) On Fri, Sep 25, 2015 at 2:46 PM, Yusuf Can Gürkan wrote: > How can i convert a Vector to RDD[Double]. For example: > > val vector = Vectors.dense(1.0,2.0) > val

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Your success case will work fine, it is a 1-1 mapping as you said. To handle failures in exactly the way you describe, you'd need to subclass or modify DirectKafkaInputDStream and change the way compute() works. Unless you really are going to have very fine-grained failures (why would only a

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Neelesh
As Cody says, to achieve true exactly once, the book keeping has to happen in the sink data system, that too assuming its a transactional store. Wherever possible, we try to make the application idempotent (upsert in HBase, ignore-on-duplicate for MySQL etc), but there are still cases (analytics,

Re: spark.streaming.concurrentJobs

2015-09-25 Thread Atul Kulkarni
Can someone please help either by explaining or pointing to documentation the relationship between #executors needed and How to let the concurrent jobs that are created by the above parameter run in parallel? On Thu, Sep 24, 2015 at 11:56 PM, Atul Kulkarni wrote: > Hi

how to handle OOMError from groupByKey

2015-09-25 Thread Elango Cheran
Hi everyone, I have an RDD of the format (user: String, timestamp: Long, state: Boolean). My task invovles converting the states, where on/off is represented as true/false, into intervals of 'on' of the format (beginTs: Long, endTs: Long). So this task requires me, per user, to line up all of

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Cody Koeninger
Spark's checkpointing system is not a transactional database, and it doesn't really make sense to try and turn it into one. On Fri, Sep 25, 2015 at 2:15 PM, Radu Brumariu wrote: > Wouldn't the same case be made for checkpointing in general ? > What I am trying to say, is that

java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
Hi, I'm getting a NotSerializableException even though I'm creating all the my objects from within the closure: import org.apache.avro.generic.GenericDatumReader import java.io.File import org.apache.avro._ val orig_schema = Schema.parse(new File("/home/wasabi/schema")) val READER = new

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
I'm sorry. Both approaches actually work. It was something else wrong with my cluster. Petr On Fri, Sep 25, 2015 at 4:53 PM, Petr Novak wrote: > Either setting it programatically doesn't work: > sparkConf.setIfMissing("class", "...Main") > > In my current setting moving

RE: hive on spark query error

2015-09-25 Thread Garry Chen
Yes you are right. Make the change and also link hive-site.xml into spark conf directory. Rerun the sql getting error in hive.log 2015-09-25 13:31:14,750 INFO [HiveServer2-Handler-Pool: Thread-125]: client.SparkClientImpl (SparkClientImpl.java:startDriver(375)) - Attempting impersonation of

Re: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-25 Thread Cheng Lian
Thanks for the clarification. Could you please provide the full schema of your table and query plans of your query? You may obtain them via: hiveContext.table("your_table").printSchema() and hiveContext.sql("your query").explain(extended = true) You also mentioned "Thrift" in the subject,

Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Hi, We are using DirectKafkaInputDStream and store completed consumer offsets in Kafka (0.8.2). However, some of our use case require that offsets be not written if processing of a partition fails with certain exceptions. This allows us to build various backoff strategies for that partition,

Re: Kafka & Spark Streaming

2015-09-25 Thread Petr Novak
You can have offsetRanges on workers f.e. object Something { var offsetRanges = Array[OffsetRange]() def create[F : ClassTag](stream: InputDStream[Array[Byte]]) (implicit codec: Codec[F]: DStream[F] = { stream transform { rdd => offsetRanges =

Convert Vector to RDD[Double]

2015-09-25 Thread Yusuf Can Gürkan
How can i convert a Vector to RDD[Double]. For example: val vector = Vectors.dense(1.0,2.0) val rdd // i need sc.parallelize(Array(1.0,2.0)) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

RE: hive on spark query error

2015-09-25 Thread Garry Chen
In spark-defaults.conf the spark.master is spark://hostname:7077. From hive-site.xml spark.master hostname From: Jimmy Xiang [mailto:jxi...@cloudera.com] Sent: Friday, September 25, 2015 1:00 PM To: Garry Chen Cc: user@spark.apache.org Subject: Re: hive on

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Ted Yu
Is the Schema.parse() call expensive ? Can you call it in the closure ? On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm getting a NotSerializableException even though I'm creating all the my > objects from within the closure: > import

Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote: > In spark-defaults.conf the spark.master is spark://hostname:7077. From > hive-site.xml > spark.master > hostname > That's not a valid value for spark.master (as the error indicates). You should set it to

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers also has an example of how to close over the offset ranges so they are available on executors. On Fri, Sep 25, 2015 at 12:50 PM, Neelesh wrote: > Hi, >We are

Re: Generic DataType in UDAF

2015-09-25 Thread Yin Huai
Hi Ritesh, Right now, we only allow specific data types defined in the inputSchema. Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF be more complex. It will be great to understand the use cases first. What kinds of possible input data types that you want to support and

Re: Generic DataType in UDAF

2015-09-25 Thread Ritesh Agrawal
hi Yin, I have a written a simple UDAF to generate N samples for each group. I am using reservoir sampling algorithm for this. In this case since the input data type doesn't matter as I am not doing any kind of processing on the input data but just selecting them by random and building an array

Broadcast to executors with multiple cores

2015-09-25 Thread Jeff Palmucci
So I have a large data structure that I want to broadcast to my executors. It is so large that it makes sense to share access to the object between multiple tasks, so I create my executors with multiple cores. Unfortunately, it looks like the object is not shared between threads, but is copied

[SPARK-SQL] Requested array size exceeds VM limit

2015-09-25 Thread Sadhan Sood
I am trying to run a query on a month of data. The volume of data is not much, but we have a partition per hour and per day. The table schema is heavily nested with total of 300 leaf fields. I am trying to run a simple select count(*) query on the table and running into this exception: SELECT

GraphX create graph with multiple node attributes

2015-09-25 Thread JJ
Hi, I am new to Spark and GraphX, so thanks in advance for your patience. I want to create a graph with multiple node attributes. Here is my code: But I receive error: Can someone help? Thanks! -- View this message in context:

Fwd: Spark for Oracle sample code

2015-09-25 Thread Cui Lin
Hello, All, I found the examples for JDBC connection are mostly read the whole table and then do operations like joining. val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load() Sometimes it is not practical

Spark for Oracle sample code

2015-09-25 Thread Cui Lin
Hello, All, I found the examples for JDBC connection are mostly read the whole table and then do operations like joining. val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load() Sometimes it is not practical

Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Jerry Lam
Hi sparkers, Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext? I don't want to start up a metastore and derby just because I need LATERAL VIEW EXPLODE. I have been trying but I always get the exception like this: Name: java.lang.RuntimeException Message: [1.68] failure: ``union''

Error in starting sparkR: Error in socketConnection(port = monitorPort) :

2015-09-25 Thread Jonathan Yue
I have been trying to start p sparkR but always get the error about monitorPort: export LD_LIBRARY_PATH=$HOME/jaguar/lib:$HOME/opt/hadoop/lib/nativeJDBCJAR=$HOME/jaguar/lib/jaguar-jdbc-2.0.jarsparkR \ --driver-class-path $JDBCJAR \ --driver-library-path $HOME/jaguar/lib \  --conf

Re: Spark for Oracle sample code

2015-09-25 Thread Jonathan Yue
In your dbtable you can insert "select ..." instead of table name. I never tried, but saw example from the web. Best regards,  Jonathan  From: Cui Lin To: user Sent: Friday, September 25, 2015 4:12 PM Subject: Spark for Oracle sample

How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Zhiliang Zhu
Hi All, I would like to submit spark job on some another remote machine outside the cluster,I also copied hadoop/spark conf files under the remote machine, then hadoopjob would be submitted, but spark job would not. In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Michael Armbrust
The SQL parser without HiveContext is really simple, which is why I generally recommend users use HiveContext. However, you can do it with dataframes: import org.apache.spark.sql.functions._ table("purchases").select(explode(df("purchase_items")).as("item")) On Fri, Sep 25, 2015 at 4:21 PM,

Re: Spark for Oracle sample code

2015-09-25 Thread Michael Armbrust
In most cases predicates that you add to jdbcDF will be push down into oracle, preventing the whole table from being sent over. df.where("column = 1") Another common pattern is to save the table to parquet or something for repeat querying. Michael On Fri, Sep 25, 2015 at 3:13 PM, Cui Lin

How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Zhiliang Zhu
Hi All, I would like to submit spark job on some another remote machine outside the cluster,I also copied hadoop/spark conf files under the remote machine, then hadoopjob would be submitted, but spark job would not. In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Gavin Yue
Print out your env variables and check first Sent from my iPhone > On Sep 25, 2015, at 18:43, Zhiliang Zhu wrote: > > Hi All, > > I would like to submit spark job on some another remote machine outside the > cluster, > I also copied hadoop/spark conf files under

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Zhiliang Zhu
Hi Yue, Thanks very much for your kind reply. I would like to submit spark job remotely on another machine outside the cluster,and the job will run on yarn, similar as hadoop job is already done, could youconfirm it could exactly work for spark... Do you mean that I would print those variables

What is this Input Size in Spark Application Detail UI?

2015-09-25 Thread Chirag Dewan
Hi All, I was wondering what does the Input Size in Application UI mean? For my 3 node Cassandra Cluster, with 3 node Spark Cluster this size is 32GB. For my 15 node Cassandra Cluster, with 15 node Spark Cluster this size reaches 172GB. Though the data in both clusters is about same volume.

Re: Stop a Dstream computation

2015-09-25 Thread Sean Owen
Your exception handling is occurring on the driver, where you 'configure' the job. I don't think it's what you mean to do. You probably mean to do this within a function you are executing on data within the cluster, like mapPartitions etc. On Fri, Sep 25, 2015 at 5:20 AM, Samya

Re: Reasonable performance numbers?

2015-09-25 Thread Adrian Tanase
It’s really hard to answer this, as the comparison is not really fair – Storm is much lower level than Spark and has less overhead when dealing with stateless operations. I’d be curious how is your colleague implementing the Average on a “batch” and what is the storm equivalent of a Batch.

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Adrian Tanase
Hi Radu, The problem itself is not checkpointing the data – if your operations are stateless then you are only checkpointing the kafka offsets, you are right. The problem is that you are also checkpointing metadata – including the actual Code and serialized java classes – that’s why you’ll see

Re: Using Spark for portfolio manager app

2015-09-25 Thread Thúy Hằng Lê
Thanks all for the feedback so far. I havn't decided which external storage will be used yet. HBase is cool but it requires Hadoop in production. I only have 3-4 servers for the whole things ( i am thinking of a relational database for this, can be MariaDB, Memsql or mysql) but they are hard to

SQLContext.read().json() inferred schema - force type to strings?

2015-09-25 Thread Ewan Leith
Hi all, We're uising SQLContext.read.json to read in a stream of JSON datasets, but sometimes the inferred schema contains for the same value a LongType, and sometimes a DoubleType. This obviously causes problems with merging the schema, so does anyone know a way of forcing the inferred

Weird worker usage

2015-09-25 Thread N B
Hello all, I have a Spark streaming application that reads from a Flume Stream, does quite a few maps/filters in addition to a few reduceByKeyAndWindow and join operations before writing the analyzed output to ElasticSearch inside a foreachRDD()... I recently started to run this on a 2 node

Error: Asked to remove non-existent executor

2015-09-25 Thread Tracewski, Lukasz
Hi, I am trying to submit a job on Spark 1.4 (with Spark Master): bin/spark-submit --master spark://:7077 --driver-memory 4g --executor-memory 4G --executor-cores 4 --num-executors 1 spark/examples/src/main/python/pi.py 6 which returns: ERROR SparkDeploySchedulerBackend: Asked to remove

Setting Spark TMP Directory in Cluster Mode

2015-09-25 Thread mufy
Faced with an issue where Spark temp files get filled under /opt/spark-1.2.1/tmp on the local filesystem on the worker nodes. Which parameter/configuration sets that location? The spark-env.sh file has the folders set as, export SPARK_HOME=/opt/spark-1.2.1 export SPARK_WORKER_DIR=$SPARK_HOME/tmp

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Akhil Das
Which version of spark are you having? Can you also check whats set in your conf/spark-defaults.conf file? Thanks Best Regards On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue wrote: > Running Spark app over Yarn 2.7 > > Here is my sparksubmit setting: > --master yarn-cluster

Re: Error: Asked to remove non-existent executor

2015-09-25 Thread Akhil Das
What you mean by you are behind a NAT? Does it mean you are submitting your jobs to a remote spark cluster from your local machine? If that's the case then you need to take care of few ports (in the NAT) http://spark.apache.org/docs/latest/configuration.html#networking which assume random as

sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread our...@cnsuning.com
hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately ,

回复: sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread our...@cnsuning.com
https://issues.apache.org/jira/browse/SPARK-10832 发件人: our...@cnsuning.com 发送时间: 2015-09-25 20:36 收件人: user 抄送: 494165115 主题: sometimes No event logs found for application using same JavaSparkSQL example hi all, when using JavaSparkSQL example,the code was submit many times as

Transformation pipeling and parallelism in Spark

2015-09-25 Thread Zhongmiao Li
Hello all, I have a question regarding the pipelining and parallelism of transformations in Spark. I couldn’t find any documentation about it and I would really appreciate your help if you could help me with it. I just started using and reading Spark, so I guess my description may not be very

Re: LogisticRegression models consumes all driver memory

2015-09-25 Thread Eugene Zhulenev
Problem turned out to be in too high 'spark.default.parallelism', BinaryClassificationMetrics are doing combineByKey which internally shuffle train dataset. Lower parallelism + cutting train set RDD history with save/read into parquet solved the problem. Thanks for hint! On Wed, Sep 23, 2015 at

How to set spark envoirnment variable SPARK_LOCAL_IP in conf/spark-env.sh

2015-09-25 Thread Zhiliang Zhu
Hi all, The spark job will run on yarn. While I do not set SPARK_LOCAL_IP any, or just set asexport  SPARK_LOCAL_IP=localhost    #or set as the specific node ip on the specific spark install directory It will work well to submit spark job on master node of cluster, however, it will fail by

Re: Weird worker usage

2015-09-25 Thread Akhil Das
Parallel tasks totally depends on the # of partitions that you are having, if you are not receiving sufficient partitions (partitions > total # cores) then try to do a .repartition. Thanks Best Regards On Fri, Sep 25, 2015 at 1:44 PM, N B wrote: > Hello all, > > I have a

Receiver and Parallelization

2015-09-25 Thread nibiau
Hello, I used a custom receiver in order to receive JMS messages from MQ Servers. I want to benefit of Yarn cluster, my questions are : - Is it possible to have only one node receiving JMS messages and parralelize the RDD over all the cluster nodes ? - Is it possible to parallelize also the

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
Just use the official connector from DataStax https://github.com/datastax/spark-cassandra-connector Your solution is very similar. Let’s assume the state is case class UserState(amount: Int, updates: Seq[Int]) And your user has 100 - If your user does not see an update, you can emit

Spark task error

2015-09-25 Thread madhvi.gupta
Hi, My configurations are follows: SPARK_EXECUTOR_INSTANCES=4 SPARK_EXECUTOR_MEMORY=1G But on my spark UI it shows: * *Alive Workers:*1 * *Cores in use:*4 Total, 0 Used * *Memory in use:*6.7 GB Total, 0.0 B Used Also while running a program in java for spark I am getting the following

Re: Setting Spark TMP Directory in Cluster Mode

2015-09-25 Thread Akhil Das
Try with spark.local.dir in the spark-defaults.conf or SPARK_LOCAL_DIR in the spark-env.sh file. Thanks Best Regards On Fri, Sep 25, 2015 at 2:14 PM, mufy wrote: > Faced with an issue where Spark temp files get filled under > /opt/spark-1.2.1/tmp on the local filesystem

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Zhiliang Zhu
And the remote machine is not in the same local area network with the cluster . On Friday, September 25, 2015 12:28 PM, Zhiliang Zhu wrote: Hi Zhan, I have done that as your kind help. However, I just could use "hadoop fs -ls/-mkdir/-rm XXX" commands

spark.streaming.concurrentJobs

2015-09-25 Thread Atul Kulkarni
Hi Folks, I am trying to speed up my spark streaming job, I found a presentation by Tathagata Das that mentions to increase value of "spark.streaming.concurrentJobs" if I have more than one output. In my spark streaming job I am reading from Kafka using Receiver Bases approach and transforming

Best practices for scheduling Spark jobs on "shared" YARN cluster using Autosys

2015-09-25 Thread unk1102
Hi I have 5 Spark jobs which needs to be run in parallel to speed up process they take around 6-8 hours together. I have 93 container nodes with 8 cores each memory capacity of around 2.8 TB. Now I runs each jobs with around 30 executors with 2 cores and 20 GB each. My each jobs processes around 1

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
Re: DB I strongly encourage you to look at Cassandra – it’s almost as powerful as Hbase, a lot easier to setup and manage. Well suited for this type of usecase, with a combination of K/V store and time series data. For the second question, I’ve used this pattern all the time for “flash

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Zhiliang Zhu
Hi Steve, Thanks a lot for your reply. That is, some commands could work on the remote server gateway installed , but some other commands will not work.As expected, the remote machine is not in the same area network as the cluster, and the cluster's portis forbidden. While I make the remote

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-25 Thread Petr Novak
Many thanks Cody, it explains quite a bit. I had couple of problems with checkpointing and graceful shutdown moving from working code in Spark 1.3.0 to 1.5.0. Having InterruptedExceptions, KafkaDirectStream couldn't initialize, some exceptions regarding WAL even I'm using direct stream. Meanwhile

Re: Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-25 Thread Steve Loughran
On 25 Sep 2015, at 03:35, Zhang, Jingyu > wrote: I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Steve Loughran
On 25 Sep 2015, at 05:25, Zhiliang Zhu > wrote: However, I just could use "hadoop fs -ls/-mkdir/-rm XXX" commands to operate at the remote machine with gateway, which means the namenode is reachable; all those commands only

Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-25 Thread Uthayan Suthakar
Thank you Tathagata and Therry for your response. You guys were absolutely correct that I created a dummy Dstream (to prevent Flume channel filling up) and counted the messages but I didn't output(print), hence is why it reported that error. Since I called print(), the error is no longer is being

Re: Unreachable dead objects permanently retained on heap

2015-09-25 Thread Saurav Sinha
Hi Spark Users, I am running some spark jobs which is running every hour.After running for 12 hours master is getting killed giving exception as *java.lang.OutOfMemoryError: GC overhead limit exceeded* It look like there is some memory issue in spark master. Same kind of issue I noticed with

HDFS is undefined

2015-09-25 Thread Angel Angel
hello, I am running the spark application. I have installed the cloudera manager. it includes the spark version 1.2.0 But now i want to use spark version 1.4.0. its also working fine. But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting the following error. "Exception in

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
1) yes, just use .repartition on the inbound stream, this will shuffle data across your whole cluster and process in parallel as specified. 2) yes, although I’m not sure how to do it for a totally custom receiver. Does this help as a starting point?

  1   2   >