Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Hi Dhruve, thanks. I've solved the issue with adding max executors. I wanted to find some place where I can add this behavior in Spark so that user should not have to worry about the max executors. Cheers - Thanks, via mobile, excuse brevity. On Sep 24, 2016 1:15 PM, "dhruve ashar"

Re: Tuning Spark memory

2016-09-23 Thread Takeshi Yamamuro
Hi, Currently, the memory fraction of shuffle and storage is automatically tuned by a memory manager. So, you do not need to care the parameter in most cases. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L24 // maropu On

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Is there anywhere I can help fix this ? I can see the requests being made in the yarn allocator. What should be the upperlimit of the requests made ? https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222 On Sat, Sep 24, 2016 at

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Have been playing around with configs to crack this. Adding them here where it would be helpful to others :) Number of executors and timeout seemed like the core issue. {code} --driver-memory 4G \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.maxExecutors=500 \

Error in run multiple unit test that extends DataFrameSuiteBase

2016-09-23 Thread Jinyuan Zhou
After I created two test case that FlatSpec with DataFrameSuiteBase. But I got errors when do sbt test. I was able to run each of them separately. My test cases does use sqlContext to read files. Here is the exception stack. Judging from the exception, I may need to unregister RpcEndpoint after

Re: With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Yong Zhang
df.write.format(source).mode("overwrite").save(path) Yong From: Dan Bikle Sent: Friday, September 23, 2016 6:45 PM To: user@spark.apache.org Subject: With spark DataFrame, how to write to existing folder? spark-world, I am walking through

With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Dan Bikle
spark-world, I am walking through the example here: https://github.com/databricks/spark-csv#scala-api The example complains if I try to write a DataFrame to an existing folder: *val selectedData = df.select("year", "model")selectedData.write .format("com.databricks.spark.csv")

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ayan guha
You may try copying the file to same location on all nodes and try to read from that place On 24 Sep 2016 00:20, "ABHISHEK" wrote: > I have tried with hdfs/tmp location but it didn't work. Same error. > > On 23 Sep 2016 19:37, "Aditya"

Running Spark master/slave instances in non Daemon mode

2016-09-23 Thread Jeff Puro
Hi, I recently tried deploying Spark master and slave instances to container based environments such as Docker, Nomad etc. There are two issues that I've found with how the startup scripts work. The sbin/start-master.sh and sbin/start-slave.sh start a daemon by default, but this isn't as

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mark Hamstra
> > The best network results are achieved when Spark nodes share the same > hosts as Hadoop or they happen to be on the same subnet. > That's only true for those portions of a Spark execution pipeline that are actually reading from HDFS. If you're re-using an RDD for which the needed shuffle

Re: Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Kevin Mellott
You can run Spark code using the command line or by creating a JAR file (via IntelliJ or other IDE); however, you may wish to try a Databricks Community Edition account instead. They offer Spark as a managed service, and you can run Spark commands one at a time via interactive notebooks. There are

Re: databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Holden Karau
So the good news is the csv library has been integrated into Spark 2.0 so you don't need to use that package. On the other hand if your in an older version you can included it using the standard sbt or maven package configuration. On Friday, September 23, 2016, Dan Bikle

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mich Talebzadeh
Does this assume that Spark is running on the same hosts as HDFS? Hence does increasing the latency affects the network latency on Hadoop nodes as well in your tests? The best network results are achieved when Spark nodes share the same hosts as Hadoop or they happen to be on the same subnet.

databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Dan Bikle
hello world-of-spark, I am learning spark today. I want to understand the spark code in this repo: https://github.com/databricks/spark-csv In the README.md I see this info: Linking You can link against this library in your program at the following coordinates: Scala 2.10 groupId:

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi
See the reference on shuffles , "Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors

Re: Open source Spark based projects

2016-09-23 Thread manasdebashiskar
check out spark packages https://spark-packages.org/ and you will find few awesome and a lot of super awesome projects. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html Sent from the Apache Spark User List

Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Dan Bikle
hello spark-world, I am new to spark and want to learn how to use it. I come from the Python world. I see an example at the url below: http://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param What would be an optimal way to run the above example? In the

Can somebody remove this guy?

2016-09-23 Thread Dirceu Semighini Filho
Can somebody remove this guy from the list tod...@yahoo-inc.com Just sent a message to the list and received an mail from yahoo saying that this email doesn't exist anymore. This is an automatically generated message. tod...@yahoo-inc.com is no longer with Yahoo! Inc. Your message will not be

Re: 答复: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-23 Thread Dirceu Semighini Filho
Hi Felix, Just runned your code and it prints Pi is roughly 4.0 Here is the code that I used as you didn't show what a random is I used the nextInt() val n = math.min(10L * slices, Int.MaxValue).toInt // avoid overflow val count = context.sparkContext.parallelize(1 until n, slices).map

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread sagarcasual .
Hi, Thanks for the response, The issue I am facing is only for the clustered Kafka 2.11 based version 0.10.0.1 and Spark 1.6.1 with following dependencies. org.apache.spark:spark-core_2.10:1.6.1 compile group: 'org.apache.spark', name: 'spark-streaming_2.10', version:'1.6.1' compile group:

Spark MLlib ALS algorithm

2016-09-23 Thread Roshani Nagmote
Hello, I was working on Spark MLlib ALS Matrix factorization algorithm and came across the following blog post: https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html Can anyone help me understanding what "s" scaling factor does and does it really give

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
I have tried with hdfs/tmp location but it didn't work. Same error. On 23 Sep 2016 19:37, "Aditya" wrote: > Hi Abhishek, > > Try below spark submit. > spark-submit --master yarn --deploy-mode cluster --files hdfs:// > abc.com:8020/tmp/abc.drl --class

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya
Hi Abhishek, Try below spark submit. spark-submit --master yarn --deploy-mode cluster --files hdfs://abc.com:8020/tmp/abc.drl --class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar abc.drl On Friday 23

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread Cody Koeninger
For Spark 2.0 there are two kafka artifacts, spark-streaming-kafka-0-10 (0.10 and later brokers only) and spark-streaming-kafka-0-8 (should work with 0.8 and later brokers). The docs explaining this were merged to master just after 2.0 released, so they haven't been published yet. There are usage

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
Thanks for your response Aditya and Steve. Steve: I have tried specifying both /tmp/filename in hdfs and local path but it didn't work. You may be write that Kie session is configured to access files from Local path. I have attached code here for your reference and if you find some thing wrong,

ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.

2016-09-23 Thread muhammet pakyürek
i tried to connect cassandra via spark-cassandra-conenctor2.0.0 on pyspark but i get the error below i think it s related to pyspark/context.py but i dont know how?

Tuning Spark memory

2016-09-23 Thread tan shai
Hi, I am working with Spark 2.0, the job starts by sorting the input data and storing the output on HDFS. I am getting Out of memory errors, the solution was to increase the value of spark.shuffle.memoryFraction from 0.2 to 0.8 and this solves the problem. But in the documentation I have found

Re: Apache Spark JavaRDD pipe() need help

2016-09-23 Thread शशिकांत कुलकर्णी
Thank you Jakob. I will try as suggested. Regards, Shashi On Fri, Sep 23, 2016 at 12:14 AM, Jakob Odersky wrote: > Hi Shashikant, > > I think you are trying to do too much at once in your helper class. > Spark's RDD API is functional, it is meant to be used by writing many >

UDAF collect_list: Hive Query or spark sql expression

2016-09-23 Thread Jason Mop
Hi Spark Team, I see most Hive function have been implemented by Spark SQL expression, but collect_list is still using Hive Query, will it also be implemented by Expression in future? any update? Cheers, Ming

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Steve Loughran
On 23 Sep 2016, at 08:33, ABHISHEK > wrote: at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/abhietc/abc.drl (No such file or directory)

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya
Hi Abhishek, From your spark-submit it seems your passing the file as a parameter to the driver program. So now it depends what exactly you are doing with that parameter. Using --files option it will be available to all the worker nodes but if in your code if you are referencing using the

Re: How to specify file

2016-09-23 Thread Mich Talebzadeh
You can do the following with option("delimiter") .. val df = spark.read.option("header", false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
Hello there, I have Spark Application which refer to an external file ‘abc.drl’ and having unstructured data. Application is able to find this reference file if I run app in Local mode but in Yarn with Cluster mode, it is not able to find the file in the specified path. I tried with both local

?????? How to specify file

2016-09-23 Thread Sea
Hi, Hemant, Aditya: I don't want to create temp table and write code, I just want to run sql directly on files "select * from csv.`/path/to/file`" -- -- ??: "Hemant Bhanawat";; : 2016??9??23??(??) 3:32

Re: How to specify file

2016-09-23 Thread Aditya
Hi Sea, For using Spark SQL you will need to create DataFrame from the file and then execute select * on dataframe. In your case you will need to do something like this JavaRDD DF = context.textFile("path"); JavaRDD rowRDD3 = DF.map(new Function() {

Re: How to specify file

2016-09-23 Thread Hemant Bhanawat
Check out the READEME on the following page. This is the csv connector that you are using. I think you need to specify the delimiter option. https://github.com/databricks/spark-csv Hemant Bhanawat www.snappydata.io On Fri, Sep 23, 2016 at

How to specify file

2016-09-23 Thread Sea
Hi, I want to run sql directly on files, I find that spark has supported sql like select * from csv.`/path/to/file`, but files may not be split by ','. Maybe it is split by '\001', how can I specify delimiter? Thank you!

Re: Spark RDD and Memory

2016-09-23 Thread Aditya
Hi Datta, Thanks for the reply. If I havent cached any rdd and the data that is being loaded into memory after performing some operations exceeds the memory, how it is handled by spark. Is previosly loaded rdds removed from memory to make it free for subsequent steps in DAG? I am running

Re: Spark RDD and Memory

2016-09-23 Thread Datta Khot
Hi Aditya, If you cache the RDDs - like textFile.cache(), textFile1().cache() - then it will not load the data again from file system. Once done with related operations it is recommended to uncache the RDDs to manage memory efficiently and avoid it's exhaustion. Note caching operation is with

Re: Redshift Vs Spark SQL (Thrift)

2016-09-23 Thread ayan guha
Thanks, but here is my argument that they may not be seen as different purpose: I am thinking both Redshift and Hive as a data warehousing solutions, with STS as a mechanism to lift hive's performance (if Tez or LLAP can provide similar performance, I am fine to use Hive Thrift Server as well).

Re: Redshift Vs Spark SQL (Thrift)

2016-09-23 Thread Jörn Franke
Depends what your use case is. A generic benchmark does not make sense, because they are different technologies for different purposes. > On 23 Sep 2016, at 06:09, ayan guha wrote: > > Hi > > Is there any benchmark or point of view in terms of pros and cons between AWS >