RE: Could Spark batch processing live within Spark Streaming?

2015-06-12 Thread prajod.vettiyattil
Hi Raj, What you need seems to be an event based initiation of a DStream. Have not seen one yet. There are many types of DStreams that Spark implements. You can also implement your own. InputDStream is a close match for your requirement. See this for the available options with InputDStream:

Re: Reading Really Big File Stream from HDFS

2015-06-12 Thread Saisai Shao
Using sc.textFile will also read the file from HDFS one by one line through iterator, don't need to fit all into memory, even you have small size of memory, it still can be worked. 2015-06-12 13:19 GMT+08:00 SLiZn Liu sliznmail...@gmail.com: Hmm, you have a good point. So should I load the file

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
Thanks for the extra details and explanations Chaocai, will try to reproduce this when I got chance. Cheng On 6/12/15 3:44 PM, 姜超才 wrote: I said OOM occurred on slave node, because I monitored memory utilization during the query task, on driver, very few memory was ocupied. And i remember i

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark RDD will provide API to fetch the record one by one from the final result set, instead of the pulling them all / (or whole partition data) and fit in the driver memory. Seems a big change. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To:

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
My guess the reason why local mode is OK while standalone cluster doesn't work is that in cluster mode, task results are serialized and sent to driver side. Driver need to deserialize the result, and thus occupies much more memory then local mode (where task result de/serialization is not

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-12 Thread Tathagata Das
Is it a lot of data that is expected to come through stdin? I mean is it even worth parallelizing the computation using something like Spark Streaming? On Thu, Jun 11, 2015 at 9:56 PM, Heath Guo heath...@fb.com wrote: Thanks for your reply! In my use case, it would be stream from only one

Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
Hi Chaocai, Glad that 1.4 fixes your case. However, I'm a bit confused by your last comment saying The OOM or lose heartbeat was occurred on slave node. Because from the log files you attached at first, those OOM actually happens on driver side (Thrift server log only contains log lines from

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark Core will provide API to fetch the record one by one from the block manager, instead of the pulling them all into the driver memory. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To: 姜超才; Hester wang; user@spark.apache.org Subject: Re: 回复:

Re: Spark 1.4 release date

2015-06-12 Thread Todd Nist
It was released yesterday. On Friday, June 12, 2015, ayan guha guha.a...@gmail.com wrote: Hi When is official spark 1.4 release date? Best Ayan

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Saisai Shao
Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh

If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-12 Thread Haopu Wang
This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much!

Spark 1.4 release date

2015-06-12 Thread ayan guha
Hi When is official spark 1.4 release date? Best Ayan

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-12 Thread Heath Guo
Yes, it is lots of data, and the utility I'm working with prints out infinite real time data stream. Thanks. From: Tathagata Das t...@databricks.commailto:t...@databricks.com Date: Thursday, June 11, 2015 at 11:43 PM To: Heath Guo heath...@fb.commailto:heath...@fb.com Cc: user

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-12 Thread Gerard Maas
Would using the socketTextStream and `yourApp | nc -lk port` work?? Not sure how resilient the socket receiver is though. I've been playing with it for a little demo and I don't understand yet its reconnection behavior. Although I would think that putting some elastic buffer in between would be a

Re: BigDecimal problem in parquet file

2015-06-12 Thread Cheng Lian
On 6/10/15 8:53 PM, Bipin Nag wrote: Hi Cheng, I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an existing parquet file, then repartitioning and saving it. Doing this gives the error. The code for this doesn't look like causing problem. I have a feeling the source -

Optimizing Streaming from Websphere MQ

2015-06-12 Thread Chaudhary, Umesh
Hi, I have created a Custom Receiver in Java which receives data from Websphere MQ and I am only writing the received records on HDFS. I have referred many forums for optimizing speed of spark streaming application. Here I am listing a few: * Spark

How to use Window Operations with kafka Direct-API?

2015-06-12 Thread ZIGEN
Hi, I'm using Spark Streaming(1.3.1). I want to get exactly-once messaging from Kafka and use Window operations of DStraem, When Window operations(eg DStream#reduceByKeyAndWindow) with kafka Direct-API java.lang.ClassCastException occurs as follows. --- stacktrace --

Re: Upgrade to parquet 1.6.0

2015-06-12 Thread Cheng Lian
At the time 1.3.x was released, 1.6.0 hasn't been released yet. And we didn't have enough time to upgrade and test Parquet 1.6.0 for Spark 1.4.0. But we've already upgraded Parquet to 1.7.0 (which is exactly the same as 1.6.0 with package name renamed from com.twitter to org.apache.parquet) on

Re: BigDecimal problem in parquet file

2015-06-12 Thread Bipin Nag
Hi Cheng, Yes, some rows contain unit instead of decimal values. I believe some rows from original source I had don't have any value i.e. it is null. And that shows up as unit. How does the spark-sql or parquet handle null in place of decimal values, assuming that field is nullable. I will have

Re: Spark 1.4 release date

2015-06-12 Thread ayan guha
Thanks a lot. On 12 Jun 2015 19:46, Todd Nist tsind...@gmail.com wrote: It was released yesterday. On Friday, June 12, 2015, ayan guha guha.a...@gmail.com wrote: Hi When is official spark 1.4 release date? Best Ayan

Re: Spark 1.4 release date

2015-06-12 Thread Guru Medasani
Here is a spark 1.4 release blog by data bricks. https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html Guru Medasani gdm...@gmail.com On Jun 12, 2015, at 7:08 AM, ayan guha guha.a...@gmail.com wrote:

Upgrade to parquet 1.6.0

2015-06-12 Thread Eric Eijkelenboom
Hi What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems like newer Parquet versions are available (e.g. 1.6.0). This would fix problems with ‘spark.sql.parquet.filterPushdown’, which currently is disabled by default, because of a bug in Parquet 1.6.0rc3. Thanks! Eric

Re: Upgrade to parquet 1.6.0

2015-06-12 Thread Eric Eijkelenboom
Great, thanks for the extra info! On 12 Jun 2015, at 12:41, Cheng Lian lian.cs@gmail.com wrote: At the time 1.3.x was released, 1.6.0 hasn't been released yet. And we didn't have enough time to upgrade and test Parquet 1.6.0 for Spark 1.4.0. But we've already upgraded Parquet to

Re: How to use Window Operations with kafka Direct-API?

2015-06-12 Thread zigen
Hi Shao, Thank you for your quick prompt. I was disappointed. I will try window operations with Receiver-based Approach(KafkaUtils.createStream). Thank you again, ZIGEN 2015/06/12 17:18、Saisai Shao sai.sai.s...@gmail.com のメッセージ: I think you could not use offsetRange in such way, when you

Scheduling and node affinity

2015-06-12 Thread Brian Candler
I would like to know if Spark has any facility by which particular tasks can be scheduled to run on chosen nodes. The use case: we have a large custom-format database. It is partitioned and the segments are stored on local SSD on multiple nodes. Incoming queries are matched against the

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread algermissen1971
Cody, On 12 Jun 2015, at 17:26, Cody Koeninger c...@koeninger.org wrote: There are several database apis that use a thread local or singleton reference to a connection pool (we use ScalikeJDBC currently, but there are others). You can use mapPartitions earlier in the chain to make sure

Extracting k-means cluster values along with centers?

2015-06-12 Thread Minnow Noir
Greetings. I have been following some of the tutorials online for Spark k-means clustering. I would like to be able to just dump all the cluster values and their centroids to text file so I can explore the data. I have the clusters as such: val clusters = KMeans.train(parsedData, numClusters,

--jars not working?

2015-06-12 Thread Jonathan Coveney
Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos 0.19.0)... Regardless, I'm running into a really weird situation where when I pass --jars to bin/spark-shell I can't reference those classes on the repl. Is this expected? The logs even tell me that my jars have been added, and

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread algermissen1971
On 12 Jun 2015, at 23:19, Cody Koeninger c...@koeninger.org wrote: A. No, it's called once per partition. Usually you have more partitions than executors, so it will end up getting called multiple times per executor. But you can use a lazy val, singleton, etc to make sure the setup only

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Juan Rodríguez Hortalá
Hi, If you want I would be happy to work in this. I have worked with KafkaUtils.createDirectStream before, in a pull request that wasn't accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python and I'm starting to feel comfortable with Scala, so if someone opens a JIRA I can

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread Cody Koeninger
A. No, it's called once per partition. Usually you have more partitions than executors, so it will end up getting called multiple times per executor. But you can use a lazy val, singleton, etc to make sure the setup only takes place once per JVM. B. I cant speak to the specifics there ... but

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread Cody Koeninger
Close. the mapPartitions call doesn't need to do anything at all to the iter. mapPartitions { iter = SomeDb.conn.init iter } On Fri, Jun 12, 2015 at 3:55 PM, algermissen1971 algermissen1...@icloud.com wrote: Cody, On 12 Jun 2015, at 17:26, Cody Koeninger c...@koeninger.org wrote:

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread algermissen1971
On 12 Jun 2015, at 22:59, Cody Koeninger c...@koeninger.org wrote: Close. the mapPartitions call doesn't need to do anything at all to the iter. mapPartitions { iter = SomeDb.conn.init iter } Yes, thanks! Maybe you can confirm two more things and then you helped me make a giant

How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a

Re: [Spark-1.4.0] NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-12 Thread Tao Li
Anyone met the same problem like me? 2015-06-12 23:40 GMT+08:00 Tao Li litao.bupt...@gmail.com: Hi all: I complied new spark 1.4.0 version today. But when I run WordCount demo, it throws NoSuchMethodError *java.lang.NoSuchMethodError:

Re: how to use a properties file from a url in spark-submit

2015-06-12 Thread Gary Ogden
Turns out one of the other developers wrapped the jobs in script and did a cd to another folder in the script before executing spark-submit. On 12 June 2015 at 14:20, Matthew Jones mle...@gmail.com wrote: Hmm either spark-submit isn't picking up the relative path or Chronos is not setting your

Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-12 Thread Akhil Das
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this, you can actually add jets3t-0.9.0.jar to the classpath (sc.addJar(/path/to/jets3t-0.9.0.jar). Thanks Best Regards On Thu, Jun 11, 2015 at 6:44 PM, shahab shahab.mok...@gmail.com wrote: Hi, I tried to read a csv file

Re: spark stream and spark sql with data warehouse

2015-06-12 Thread Akhil Das
This is a good start, if you haven't read it already http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations Thanks Best Regards On Thu, Jun 11, 2015 at 8:17 PM, 唐思成 jadetan...@qq.com wrote: Hi all: We are trying to using spark to do some real

[Spark 1.4.0] java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-12 Thread Peter Haumer
Hello. I used to be able to run debug my Spark apps in Eclipse for Spark 1.3.1 by creating a launch and setting the vm var -Dspark.master=local[4]. I am not playing with the new 1.4 and trying out some of my simple samples, which all fail with the same exception as shown below. Running them with

Re: Limit Spark Shuffle Disk Usage

2015-06-12 Thread Akhil Das
You can disable shuffle spill (spark.shuffle.spill http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior) if you are having enough memory to hold that much data. I believe adding more resources would be your only choice. Thanks Best Regards On Thu, Jun 11, 2015 at 9:46 PM, Al M

Re: --jars not working?

2015-06-12 Thread Akhil Das
You can verify if the jars are shipped properly by looking at the driver UI (running on 4040) Environment tab. Thanks Best Regards On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney jcove...@gmail.com wrote: Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos 0.19.0)...

Re: takeSample() results in two stages

2015-06-12 Thread Imran Rashid
It launches two jobs because it doesn't know ahead of time how big your RDD is, so it doesn't know what the sampling rate should be. After counting all the records, it can determine what the sampling rate should be -- then it does another pass through the data, sampling by the rate its just

Re: Optimizing Streaming from Websphere MQ

2015-06-12 Thread Akhil Das
How many cores are you allocating for your job? And how many receivers are you having? It would be good if you can post your custom receiver code, it will help people to understand it better and shed some light. Thanks Best Regards On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh

Re: Re: How to keep a SQLContext instance alive in a spark streaming application's life cycle?

2015-06-12 Thread Tathagata Das
BTW, in Spark 1.4 announced today, I added SQLContext.getOrCreate. So you dont need to create the singleton yourself. On Wed, Jun 10, 2015 at 3:21 AM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Note: CCing user@spark.apache.org First, you must check if the RDD is empty:

Parquet Multiple Output

2015-06-12 Thread Xin Liu
Hi, I have a scenario where I'd like to store a RDD using parquet format in many files, which corresponds to days, such as 2015/01/01, 2015/02/02, etc. So far I used this method http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job to store text files

Re: Spark SQL and Skewed Joins

2015-06-12 Thread Michael Armbrust
2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is available, would any of the JOIN enhancements help this situation? I would try Spark 1.4 after running SET spark.sql.planner.sortMergeJoin=true.

Are there ways to restrict what parameters users can set for a Spark job?

2015-06-12 Thread YaoPau
For example, Hive lets you set a whole bunch of parameters (# of reducers, # of mappers, size of reducers, cache size, max memory to use for a join), while Impala gives users a much smaller subset of parameters to work with, which makes it nice to give to a BI team. Is there a way to restrict

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
Hi Andrew, Thanks a lot! Indeed, it doesn't start with spark, the following properties are read by implementation of the driver rather than spark conf: --conf spooky.root=s3n://spooky- \ --conf spooky.checkpoint=s3://spooky-checkpoint \ This used to work from Spark 1.0.0 to 1.3.1. Do you know

Resource allocation configurations for Spark on Yarn

2015-06-12 Thread Jim Green
Hi Team, Sharing one article which summarize the Resource allocation configurations for Spark on Yarn: Resource allocation configurations for Spark on Yarn http://www.openkb.info/2015/06/resource-allocation-configurations-for.html -- Thanks, www.openkb.info (Open KnowledgeBase for

Dynamic allocator requests -1 executors

2015-06-12 Thread Patrick Woody
Hey all, I've recently run into an issue where spark dynamicAllocation has asked for -1 executors from YARN. Unfortunately, this raises an exception that kills the executor-allocation thread and the application can't request more resources. Has anyone seen this before? It is spurious and the

[Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-12 Thread Rex X
Hi, I want to use spark to select N columns, top M rows of all csv files under a folder. To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Ted Yu
This is the SPARK JIRA which introduced the warning: [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in spark-shell and spark-submit On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng rhw...@gmail.com wrote: Hi Andrew, Thanks a lot! Indeed, it doesn't start with spark, the

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Amit Ramesh
Hi Juan, I have created a ticket for this: https://issues.apache.org/jira/browse/SPARK-8337 Thanks! Amit On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, If you want I would be happy to work in this. I have worked with

Re: NullPointerException with functions.rand()

2015-06-12 Thread Ted Yu
Created PR and verified the example given by Justin works with the change: https://github.com/apache/spark/pull/6793 Cheers On Wed, Jun 10, 2015 at 7:15 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the NPE came from this line: @transient protected lazy val rng = new XORShiftRandom(seed

Re: Parquet Multiple Output

2015-06-12 Thread Cheng Lian
Spark 1.4 supports dynamic partitioning, you can first convert your RDD to a DataFrame and then save the contents partitioned by date column. Say you have a DataFrame df containing three columns a, b, and c, you may have something like this: df.write.partitionBy(a,

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
Thanks all for your information. Andrew, I dig out one of your old post which is relevant: http://apache-spark-user-list.1001560.n3.nabble.com/little-confused-about-SPARK-JAVA-OPTS-alternatives-td5798.html But didn't mention how to supply the properties that don't start with spark. On 12 June

Spark SQL and Skewed Joins

2015-06-12 Thread Jon Walton
Greetings, I am trying to implement a classic star schema ETL pipeline using Spark SQL, 1.2.1. I am running into problems with shuffle joins, for those dimension tables which have skewed keys and are too large to let Spark broadcast them. I have a few questions 1. Can I split my queries so a

Reliable SQS Receiver for Spark Streaming

2015-06-12 Thread Michal Čizmazia
I would like to have a Spark Streaming SQS Receiver which deletes SQS messages only after they were successfully stored on S3. For this a Custom Receiver can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been

Re: BigDecimal problem in parquet file

2015-06-12 Thread Davies Liu
Maybe it's related to a bug, which is fixed by https://github.com/apache/spark/pull/6558 recently. On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag bipin@gmail.com wrote: Hi Cheng, Yes, some rows contain unit instead of decimal values. I believe some rows from original source I had don't have

Broadcast value

2015-06-12 Thread Yasemin Kaya
Hi, I am taking Broadcast value from file. I want to use it creating Rating Object (ALS) . But I am getting null. Here is my code https://gist.github.com/yaseminn/d6afd0263f6db6ea4ec5 : At lines 17 18 is ok but 19 returns null so 21 returns me error. Why I don't understand.Do you have any idea

Writing data to hbase using Sparkstreaming

2015-06-12 Thread Vamshi Krishna
Hi I am trying to write data that is produced from kafka commandline producer for some topic. I am facing problem and unable to proceed. Below is my code which I am creating a jar and running through spark-submit on spark-shell. Am I doing wrong inside foreachRDD() ? What is wrong with

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-12 Thread gaurav sharma
Thanks Todd, that solved my problem. Regards, Gaurav (please excuse spelling mistakes) Sent from phone On Jun 11, 2015 6:42 PM, Todd Nist tsind...@gmail.com wrote: Hi Gaurav, Seems like you could use a broadcast variable for this if I understand your use case. Create it in the driver based

Re: how to use a properties file from a url in spark-submit

2015-06-12 Thread Gary Ogden
That's a great idea. I did what you suggested and added the url to the props file in the uri of the json. The properties file now shows up in the sandbox. But when it goes to run spark-submit with --properties-file props.properties it fails to find it: Exception in thread main

Re: ClassCastException: BlockManagerId cannot be cast to [B

2015-06-12 Thread davidkl
Hello, Just in case someone finds the same issue, it was caused by running the streaming app with different version of the cluster jars (the uber jar contained both yarn and spark). Regards J -- View this message in context:

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-12 Thread Steve Loughran
These are both really good posts: you should try and get them in to the documentation. with anything implementing dynamicness, there are some fun problems (a) detecting the delays in the workflow. There's some good ideas here (b) deciding where to address it. That means you need to monitor the

Re: How to use Window Operations with kafka Direct-API?

2015-06-12 Thread Cody Koeninger
The other thing to keep in mind about spark window operations against Kafka is that spark streaming is based on current system clock, not the time embedded in your messages. So you're going to get a fundamentally wrong answer from a window operation after a failure / restart, regardless of

RE: How to set spark master URL to contain domain name?

2015-06-12 Thread Wang, Ningjun (LNG-NPV)
I think the problem is that in my local etc/hosts file, I have 10.196.116.95 WIN02 I will remove it and try. Thanks for the help. Ningjun From: prajod.vettiyat...@wipro.com [mailto:prajod.vettiyat...@wipro.com] Sent: Friday, June 12, 2015 1:44 AM To: Wang, Ningjun (LNG-NPV) Cc:

Spark Streaming - Can i BIND Spark Executor to Kafka Partition Leader

2015-06-12 Thread gaurav sharma
Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. *Cluster details* *6 nodes*, each node running 1 Spark and kafka processes each. Node1 - 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 - 1 Worker prcocess each

Issues with `when` in Column class

2015-06-12 Thread Chris Freeman
I’m trying to iterate through a list of Columns and create new Columns based on a condition. However, the when method keeps giving me errors that don’t quite make sense. If I do `when(col === “abc”, 1).otherwise(0)` I get the following error at compile time: [error] not found: value when

Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread algermissen1971
Hi, I have a scenario with spark streaming, where I need to write to a database from within updateStateByKey[1]. That means that inside my update function I need a connection. I have so far understood that I should create a new (lazy) connection for every partition. But since I am not working

[Spark-1.4.0]jackson-databind conflict?

2015-06-12 Thread Earthson
I'm using Play-2.4 with play-json-2.4, It works fine with spark-1.3.1, but it failed after I upgrade Spark to spark-1.4.0:( sc.parallelize(1 to 1).count code [info] com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class

Re: [Spark-1.4.0]jackson-databind conflict?

2015-06-12 Thread Sean Owen
I see the same thing in an app that uses Jackson 2.5. Downgrading to 2.4 made it work. I meant to go back and figure out if there's something that can be done to work around this in Spark or elsewhere, but for now, harmonize your Jackson version at 2.4.x if you can. On Fri, Jun 12, 2015 at 4:20

Re: Issues with `when` in Column class

2015-06-12 Thread Yin Huai
Hi Chris, Have you imported org.apache.spark.sql.functions._? Thanks, Yin On Fri, Jun 12, 2015 at 8:05 AM, Chris Freeman cfree...@alteryx.com wrote: I’m trying to iterate through a list of Columns and create new Columns based on a condition. However, the when method keeps giving me errors

[Spark-1.4.0] NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-12 Thread Tao Li
Hi all: I complied new spark 1.4.0 version today. But when I run WordCount demo, it throws NoSuchMethodError *java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserialize*r. I found the default *fasterxml.jackson.version* is *2.4.4*. It's there any wrong with the

Apache Spark architecture

2015-06-12 Thread Vitalii Duk
Trying to find a complete documentation about an internal architecture of Apache Spark, but have no results there. For example I'm trying to understand next thing: Assume that we have 1Tb text file on HDFS (3 nodes in a cluster, replication factor is 1). This file will be spitted into 128Mb

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy DatabaseConnection

2015-06-12 Thread Cody Koeninger
There are several database apis that use a thread local or singleton reference to a connection pool (we use ScalikeJDBC currently, but there are others). You can use mapPartitions earlier in the chain to make sure the connection pool is set up on that executor, then use it inside updateStateByKey

Re: Issues with `when` in Column class

2015-06-12 Thread Chris Freeman
That did it! Thanks! From: Yin Huai Date: Friday, June 12, 2015 at 10:31 AM To: Chris Freeman Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Issues with `when` in Column class Hi Chris, Have you imported org.apache.spark.sql.functions._? Thanks, Yin On Fri, Jun 12, 2015

Re: how to use a properties file from a url in spark-submit

2015-06-12 Thread Matthew Jones
Hmm either spark-submit isn't picking up the relative path or Chronos is not setting your working directory to your sandbox. Try using cd $MESOS_SANDBOX spark-submit --properties-file props.properties On Fri, Jun 12, 2015 at 12:32 PM Gary Ogden gog...@gmail.com wrote: That's a great idea. I

Re: Is it possible to see Spark jobs on MapReduce job history ? (running Spark on YARN cluster)

2015-06-12 Thread Steve Loughran
For that you need SPARK-1537 and the patch to go with it It is still the spark web UI, it just hands off storage and retrieval of the history to the underlying Yarn timeline server, rather than through the filesystem. You'll get to see things as they go along too. If you do want to try it,

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Cody Koeninger
The scala api has 2 ways of calling createDirectStream. One of them allows you to pass a message handler that gets full access to the kafka MessageAndMetadata, including offset. I don't know why the python api was developed with only one way to call createDirectStream, but the first thing I'd

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
Sent from my phone On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys Using Hive and Impala daily intensively. Want to transition to spark-sql in CLI mode Currently in my sandbox I am using the Spark (standalone mode) in the CDH

Re: Optimisation advice for Avro-Parquet merge job

2015-06-12 Thread James Aley
Hey Kiran, Thanks very much for the response. I left for vacation before I could try this out, but I'll experiment once I get back and let you know how it goes. Thanks! James. On 8 June 2015 at 12:34, kiran lonikar loni...@gmail.com wrote: It turns out my assumption on load and unionAll

Re: Spark 1.4 release date

2015-06-12 Thread ayan guha
Thanks guys, my question must look like a stupid one today :) Looking forward to test out 1.4.0, just downloaded it. Congrats to the team for this much anticipate release. On Fri, Jun 12, 2015 at 10:12 PM, Guru Medasani gdm...@gmail.com wrote: Here is a spark 1.4 release blog by data bricks.

Re: Spark Java API and minimum set of 3rd party dependencies

2015-06-12 Thread Sean Owen
You don't add dependencies to your app -- you mark Spark as 'provided' in the build and you rely on the deployed Spark environment to provide it. On Fri, Jun 12, 2015 at 7:14 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi all, We want to integrate Spark in our Java application using the

[Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
In Spark 1.3.x, the system property of the driver can be set by --conf option, shared between setting spark properties and system properties. In Spark 1.4.0 this feature is removed, the driver instead log the following warning: Warning: Ignoring non-spark config property: xxx.xxx=v How do

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Andrew Or
Hi Peng, Setting properties through --conf should still work in Spark 1.4. From the warning it looks like the config you are trying to set does not start with the prefix spark.. What is the config that you are trying to set? -Andrew 2015-06-12 11:17 GMT-07:00 Peng Cheng pc...@uow.edu.au: In

Re: Exception when using CLUSTER BY or ORDER BY

2015-06-12 Thread Reynold Xin
Tom, Can you file a JIRA and attach a small reproducible test case if possible? On Tue, May 19, 2015 at 1:50 PM, Thomas Dudziak tom...@gmail.com wrote: Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext

Re: Spark Java API and minimum set of 3rd party dependencies

2015-06-12 Thread Elkhan Dadashov
Thanks for prompt response, Sean. The issue is that we are restricted on dependencies we can include in our project. There are 2 issues while including dependencies: 1) there are several dependencies which we and Spark has, but the versions are conflicting. 2) there are dependencies Spark has,

Spark Java API and minimum set of 3rd party dependencies

2015-06-12 Thread Elkhan Dadashov
Hi all, We want to integrate Spark in our Java application using the Spark Java Api and run then on the Yarn clusters. If i want to run Spark on Yarn, which dependencies are must for including ? I looked at Spark POM

log4j configuration ignored for some classes only

2015-06-12 Thread lomax0...@gmail.com
Hi all, I am running spark standalone (local[*]), and have tried to cut back on some of the logging noise from the framework by editing log4j.properties in spark/conf. The following lines are working as expected: log4j.logger.org.apache.spark=WARN

Fwd: Spark/PySpark errors on mysterious missing /tmp file

2015-06-12 Thread John Berryman
(This question is also present on StackOverflow http://stackoverflow.com/questions/30656083/spark-pyspark-errors-on-mysterious-missing-tmp-file ) I'm having issues with pyspark and a missing /tmp file. I've narrowed down the behavior to a short snippet. a=sc.parallelize([(16646160,1)]) #

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
It sounds like this might be caused by a memory configuration problem. In addition to looking at the executor memory, I'd also bump up the driver memory, since it appears that your shell is running out of memory when collecting a large query result. Sent from my phone On Jun 11, 2015, at