Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Cody Koeninger
Is there a reason not to just use scala? It's not a lot of code... and it'll be even less code in scala ;) On Wed, Aug 19, 2015 at 4:14 AM, Shushant Arora shushantaror...@gmail.com wrote: To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but OptionKafkaRDD[K, V, U, T, R] is

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Zoltán Zvara
I personally build with SBT and run Spark on YARN with IntelliJ. You need to connect to remote JVMs with a remote debugger. You also need to do similar, if you use Python, because it will launch a JVM on the driver aswell. On Wed, Aug 19, 2015 at 2:10 PM canan chen ccn...@gmail.com wrote:

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
Hi, thank you all for the asssistance. It is odd, it works when creating a new java.mathBigDecimal object, but not if I work directly with scala 5 match { case x: java.math.BigDecimal = 2 } console:23: error: scrutinee is incompatible with pattern type; found : java.math.BigDecimal required:

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
It is okay. this methodology works very well for mapping objects of my Seq[Any]. It is indeed very cool :-) Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, August 19, 2015 10:47 AM To: Ellafi, Saif A. Cc: sujitatgt...@gmail.com; wrbri...@gmail.com; user@spark.apache.org Subject:

Re: What is the reason for ExecutorLostFailure?

2015-08-19 Thread VIJAYAKUMAR JAWAHARLAL
Hints are good. Thanks Corey. I will try to find out more from the logs. On Aug 18, 2015, at 7:23 PM, Corey Nolet cjno...@gmail.com wrote: Usually more information as to the cause of this will be found down in your logs. I generally see this happen when an out of memory exception has

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread canan chen
Thanks Ted. I notice another thread about running spark programmatically (client mode for standalone and yarn). Would it be much easier to debug spark if is is possible ? Hasn't anyone thought about it ? On Wed, Aug 19, 2015 at 5:50 PM, Ted Yu yuzhih...@gmail.com wrote: See this thread:

Re: Scala: How to match a java object????

2015-08-19 Thread Ted Yu
Saif: In your example below, the error was due to there is no automatic conversion from Int to BigDecimal. Cheers On Aug 19, 2015, at 6:40 AM, saif.a.ell...@wellsfargo.com saif.a.ell...@wellsfargo.com wrote: Hi, thank you all for the asssistance. It is odd, it works when creating a

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread canan chen
Anyone know about this ? Or do I miss something here ? On Fri, Aug 7, 2015 at 4:20 PM, canan chen ccn...@gmail.com wrote: Is there any reason that historyserver use another property for the event log dir ? Thanks

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Muhammad Haseeb Javed
Thanks Andrew for a detailed response, So the reason why key value pairs with same keys are always found in a single buckets in Hash based shuffle but not in Sort is because in sort-shuffle each mapper writes a single partitioned file, and it is up to the reducer to fetch correct partitions from

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
any differences in number of cores, memory settings for executors? On 19 August 2015 at 09:49, Rick Moritz rah...@gmail.com wrote: Dear list, I am observing a very strange difference in behaviour between a Spark 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin

Re: Spark return key value pair

2015-08-19 Thread Dawid Wysakowicz
I am not 100% sure but probably flatMap unwinds the tuples. Try with map instead. 2015-08-19 13:10 GMT+02:00 Jerry OELoo oylje...@gmail.com: Hi. I want to parse a file and return a key-value pair with pySpark, but result is strange to me. the test.sql is a big fie and each line is usename

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread William Briggs
When submitting to YARN, you can specify two different operation modes for the driver with the --master parameter: yarn-client or yarn-cluster. For more information on submitting to YARN, see this page in the Spark docs: http://spark.apache.org/docs/latest/running-on-yarn.html yarn-cluster mode

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
i would compare spark ui metrics for both cases and see any differences(number of partitions, number of spills etc) why can't you make repl to be consistent with zepellin spark version? might be rc has issues... On 19 August 2015 at 14:42, Rick Moritz rah...@gmail.com wrote: No, the setup

Fwd: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
oops, forgot to reply-all on this thread. -- Forwarded message -- From: Rick Moritz rah...@gmail.com Date: Wed, Aug 19, 2015 at 2:46 PM Subject: Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell To: Igor Berman igor.ber...@gmail.com Those values are not

RE: Failed to fetch block error

2015-08-19 Thread java8964
From the log, it looks like the OS user who is running spark cannot open any more file. Check your ulimit setting for that user: ulimit -aopen files (-n) 65536 Date: Tue, 18 Aug 2015 22:06:04 -0700 From: swethakasire...@gmail.com To: user@spark.apache.org Subject: Failed

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
No, the setup is one driver with 32g of memory, and three executors each with 8g of memory in both cases. No core-number has been specified, thus it should default to single-core (though I've seen the yarn-owned jvms wrapping the executors take up to 3 cores in top). That is, unless, as I

blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Todd
Hi, I would ask if there are some blogs/articles/videos on how to analyse spark performance during runtime,eg, tools that can be used or something related.

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
That worked great, thanks Andrew. On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or and...@databricks.com wrote: Hi Axel, You can try setting `spark.deploy.spreadOut` to false (through your conf/spark-defaults.conf file). What this does is essentially try to schedule as many cores on one worker as

Re: SQLContext Create Table Problem

2015-08-19 Thread Yin Huai
Can you try to use HiveContext instead of SQLContext? Your query is trying to create a table and persist the metadata of the table in metastore, which is only supported by HiveContext. On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan yu...@useinsider.com wrote: Hello, I’m trying to create a

Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-19 Thread jlg
Some background on what we're trying to do: We have four Kinesis receivers with varying amounts of data coming through them. Ultimately we work on a unioned stream that is getting about 11 MB/second of data. We use a batch size of 5 seconds. We create four distinct DStreams from this data that

Re: SQLContext Create Table Problem

2015-08-19 Thread Yusuf Can Gürkan
Hey Yin, Thanks for answer. I thought that this could be problem but i can not create HiveContext because i can not import org.apache.spark.sql.hive.HiveContext. It does not see this package. I read that i should build spark with -PHive but i’m running on Amazon EMR 1.4.1 and on spark-shell

How to overwrite partition when writing Parquet?

2015-08-19 Thread Romi Kuntsman
Hello, I have a DataFrame, with a date column which I want to use as a partition. Each day I want to write the data for the same date in Parquet, and then read a dataframe for a date range. I'm using: myDataframe.write().partitionBy(date).mode(SaveMode.Overwrite).parquet(parquetDir); If I use

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Gourav Sengupta
Excellent resource: http://www.oreilly.com/pub/e/3330 And more amazing is the fact that the presenter actually responds to your questions. Regards, Gourav Sengupta On Wed, Aug 19, 2015 at 4:12 PM, Todd bit1...@163.com wrote: Hi, I would ask if there are some blogs/articles/videos on how to

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
hmm maybe I spoke too soon. I have an apache zeppelin instance running and have configured it to use 48 cores (each node only has 16 cores), so I figured by setting it to 48, would mean that spark would grab 3 nodes. what happens instead though is that spark, reports that 48 cores are being

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Igor Berman
you don't need to register, search in youtube for this video... On 19 August 2015 at 18:34, Gourav Sengupta gourav.sengu...@gmail.com wrote: Excellent resource: http://www.oreilly.com/pub/e/3330 And more amazing is the fact that the presenter actually responds to your questions. Regards,

Re: Too many files/dirs in hdfs

2015-08-19 Thread Mohit Anchlia
My question was how to do this in Hadoop? Could somebody point me to some examples? On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Of course, Java or Scala can do that: 1) Create a FileWriter with append or roll over option 2) For each RDD create a StringBuilder

SQLContext load. Filtering files

2015-08-19 Thread Masf
Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-19 Thread Marcin Kuthan
I'm glad that I could help :) 19 sie 2015 8:52 AM Shenghua(Daniel) Wan wansheng...@gmail.com napisał(a): +1 I wish I have read this blog earlier. I am using Java and have just implemented a singleton producer per executor/JVM during the day. Yes, I did see that NonSerializableException when

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Ramkumar V
I'm getting some spark exception. Please look this log trace ( *http://pastebin.com/xL9jaRUa http://pastebin.com/xL9jaRUa* ). *Thanks*, https://in.linkedin.com/in/ramkumarcs31 On Wed, Aug 19, 2015 at 10:20 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: It looks like you are having

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
will try scala. Only Reason of not using scala is - never worked on that. On Wed, Aug 19, 2015 at 7:34 PM, Cody Koeninger c...@koeninger.org wrote: Is there a reason not to just use scala? It's not a lot of code... and it'll be even less code in scala ;) On Wed, Aug 19, 2015 at 4:14 AM,

PySpark on Mesos - Scaling

2015-08-19 Thread scox
I'm running a pyspark program on a Mesos cluster and seeing behavior where: * The first three stages run with multiple (10-15) tasks. * The fourth stage runs with only one task. * It is using 10 cpus, which is 5 machines in this configuration * It is very slow I would like it to use more

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Andrew Or
Hi Axel, what spark version are you using? Also, what do your configurations look like for the following? - spark.cores.max (also --total-executor-cores) - spark.executor.cores (also --executor-cores) 2015-08-19 9:27 GMT-07:00 Axel Dahl a...@whisperstream.com: hmm maybe I spoke too soon. I

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-19 Thread Romi Kuntsman
If you create a PairRDD from the DataFrame, using dataFrame.toRDD().mapToPair(), then you can call partitionBy(someCustomPartitioner) which will partition the RDD by the key (of the pair). Then the operations on it (like joining with another RDD) will consider this partitioning. I'm not sure that

Re: Spark return key value pair

2015-08-19 Thread Robin East
Dawid is right, if you did words.count it would be twice the number of input lines. You can use map like this: words = lines.map(mapper2) for i in words.take(10): msg = i[0] + :” + i[1] + \n” --- Robin East

Re: Issues with S3 paths that contain colons

2015-08-19 Thread Romi Kuntsman
I had the exact same issue, and overcame it by overriding NativeS3FileSystem with my own class, where I replaced the implementation of globStatus. It's a hack but it works. Then I set the hadoop config fs.myschema.impl to my class name, and accessed the files through myschema:// instead of s3n://

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread Andrew Or
Hi Canan, The event log dir is a per-application setting whereas the history server is an independent service that serves history UIs from many applications. If you use history server locally then the `spark.history.fs.logDirectory` will happen to point to `spark.eventLog.dir`, but the use case

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
Yes, in other words, a bucket is a single file in hash-based shuffle (no consolidation), but a segment of partitioned file in sort-based shuffle. 2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk: Thanks Andrew for a detailed response, So the reason why key value pairs

Re: issue Running Spark Job on Yarn Cluster

2015-08-19 Thread stark_summer
Please look at more about hadoop logs, such as yarn logs -applicationId xxx attach more logs to this topic -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24350.html Sent from the Apache Spark User List

How to set the number of executors and tasks in a Spark Streaming job in Mesos

2015-08-19 Thread swetha
Hi, How to set the number of executors and tasks in a Spark Streaming job in Mesos? I have the following settings but my job still shows me 11 active tasks and 11 executors. Any idea as to why this is happening ? sparkConf.set(spark.mesos.coarse, true) sparkConf.set(spark.cores.max, 128)

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread canan chen
Thanks Andrew, make sense. On Thu, Aug 20, 2015 at 4:40 AM, Andrew Or and...@databricks.com wrote: Hi Canan, The event log dir is a per-application setting whereas the history server is an independent service that serves history UIs from many applications. If you use history server locally

Re:Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Todd
I think Yarn ResourceManager has the mechanism to relaunch the driver on failure. But I am uncertain. Could someone help on this? Thanks. At 2015-08-19 16:37:32, Spark Enthusiast sparkenthusi...@yahoo.in wrote: Thanks for the reply. Are Standalone or Mesos the only options? Is there a

Spark return key value pair

2015-08-19 Thread Jerry OELoo
Hi. I want to parse a file and return a key-value pair with pySpark, but result is strange to me. the test.sql is a big fie and each line is usename and password, with # between them, I use below mapper2 to map data, and in my understanding, i in words.take(10) should be a tuple, but the result is

Spark UI returning error 500 in yarn-client mode

2015-08-19 Thread mosheeshel
Running Spark on YARN in yarn-client mode, everything appears to be working just fine except I can't access Spark UI via the Application Master link in YARN UI or directly at http://driverip:4040/jobs I get error 500, and the driver log shows the error pasted below: When running the same job

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made sure that those CLs are included but it's still very slow On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com wrote: Yes,

creating data warehouse with Spark and running query with Hive

2015-08-19 Thread Jeetendra Gangele
HI All, I have a data in HDFS partition with Year/month/data/event_type. And I am creating a hive tables with this data, this data is in JSON so I am using json serve and creating hive tables. below is the code val jsonFile =

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Can you make some more profiling? I am wondering if the driver is busy with scanning the HDFS / S3. Like jstack pid of driver process And also, it’s will be great if you can paste the physical plan for the simple query. From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August

Re: Spark Interview Questions

2015-08-19 Thread Sandeep Giri
Thank you All. I have updated it to a little better version. Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] https://linkedin.com/company/knowbigdata [image: other site icon]

Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
Dear list, I am observing a very strange difference in behaviour between a Spark 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin interpreter (compiled with Java 6 and sourced from maven central). The workflow loads data from Hive, applies a number of transformations

Re:How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Todd
There is an option for the spark-submit (Spark standalone or Mesos with cluster deploy mode only) --supervise If given, restarts the driver on failure. At 2015-08-19 14:55:39, Spark Enthusiast sparkenthusi...@yahoo.in wrote: Folks, As I see, the Driver program is a

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-19 Thread UMESH CHAUDHARY
Just add spark_1.4.1_yarn_shuffle.jar in ClassPath or create a New Maven project using below dependency: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.11/artifactId version1.4.1/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-sql_2.11/artifactId

SaveAsTable changes the order of rows

2015-08-19 Thread Kevin Jung
I have a simple RDD with Key/Value and do val partitioned = rdd.partitionBy(new HashPartitioner(400)) val row = partitioned.first I can get a key G2726 from a returned row. This first row is located on a partition #0 because G2726.hashCode is 67114000 and 67114000%400 is 0. But the order of

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-19 Thread Shenghua(Daniel) Wan
+1 I wish I have read this blog earlier. I am using Java and have just implemented a singleton producer per executor/JVM during the day. Yes, I did see that NonSerializableException when I was debugging the code ... Thanks for sharing. On Tue, Aug 18, 2015 at 10:59 PM, Tathagata Das

Does spark sql support column indexing

2015-08-19 Thread Todd
I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.

How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Spark Enthusiast
Folks, As I see, the Driver program is a single point of failure. Now, I have seen ways as to how to make it recover from failures on a restart (using Checkpointing) but I have not seen anything as to how to restart it automatically if it crashes. Will running the Driver as a Hadoop Yarn

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtdZv0d1btRHl/Spark+build+modulesubj=Building+Spark+Building+just+one+module+ On Aug 19, 2015, at 1:44 AM, canan chen ccn...@gmail.com wrote: I want to work on one jira, but it is not easy to do unit test, because it involves different

Spark UI returning error 500 in yarn-client mode

2015-08-19 Thread Moshe Eshel
Running Spark on YARN in yarn-client mode, everything appears to be working just fine except I can't access Spark UI via the Application Master link in YARN UI or directly at http://driverip:4040/jobs I get error 500, and the driver log shows the error pasted below: When running the same job

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but OptionKafkaRDD[K, V, U, T, R] is not subclass of OptionRDD[R]; In scala C[T’] is a subclass of C[T] as per https://twitter.github.io/scala_school/type-basics.html but this is not allowed in java. So is there any workaround

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Ramkumar V
Thanks a lot for your suggestion. I had modified HADOOP_CONF_DIR in spark-env.sh so that core-site.xml is under HADOOP_CONF_DIR. i can able to see the logs like that you had shown above. Now i can able to run for 3 minutes and store results between every minutes. After sometimes, there is an

回复:Does spark sql support column indexing

2015-08-19 Thread prosp4300
The answer is simply NO, But I hope someone could give more deep insight or any meaningful reference 在2015年08月19日 15:21,Todd 写道: I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Ramkumar V
We are using Cloudera-5.3.1. since it is one of the earlier version of CDH, it doesnt supports the latest version of spark. So i installed spark-1.4.1 separately in my machine. I couldnt able to do spark-submit in cluster mode. How to core-site.xml under classpath ? it will be very helpful if you

What's the best practice for developing new features for spark ?

2015-08-19 Thread canan chen
I want to work on one jira, but it is not easy to do unit test, because it involves different components especially UI. spark building is pretty slow, I don't want to build it each time to test my code change. I am wondering how other people do ? Is there any experience can share ? Thanks

Re: Programmatically create SparkContext on YARN

2015-08-19 Thread Andreas Fritzler
Hi Andrew, Thanks a lot for your response. I am aware of the '--master' flag in the spark-submit command. However I would like to create the SparkContext inside my coding. Maybe I should elaborate a little bit further: I would like to reuse e.g. the result of any Spark computation inside my

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Spark Enthusiast
Thanks for the reply. Are Standalone or Mesos the only options? Is there a way to auto relaunch if driver runs as a Hadoop Yarn Application? On Wednesday, 19 August 2015 12:49 PM, Todd bit1...@163.com wrote: There is an option for the spark-submit (Spark standalone or Mesos with

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I guess the question is why does spark have to do partition discovery with all partitions when the query only needs to look at one partition? Is there a conf flag to turn this off? On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com wrote: I've had the same problem. It turns

Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
Hi all, I did a simple experiment with Spark SQL. I created a partitioned parquet table with only one partition (date=20140701). A simple `select count(*) from table where date=20140701` would run very fast (0.1 seconds). However, as I added more partitions the query takes longer and longer. When

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Philip Weaver
I've had the same problem. It turns out that Spark (specifically parquet) is very slow at partition discovery. It got better in 1.5 (not yet released), but was still unacceptably slow. Sadly, we ended up reading parquet files manually in Python (via C++) and had to abandon Spark SQL because of