Yarn resource utilization with Spark pipe()

2016-11-23 Thread Sameer Choudhary
Hi, I am working on an Spark 1.6.2 application on YARN managed EMR cluster that uses RDD's pipe method to process my data. I start a light weight daemon process that starts processes for each task via pipes. This is to ensure that I don't run into https://issues.apache.org/jira/browse/SPARK-671.

Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
This problem appears to be a regression on HEAD/master: when running against 2.0.2 the pyspark job completes successfully including running predictions. 2016-11-23 19:36 GMT-08:00 Stephen Boesch : > > For a pyspark job with 54 executors all of the task outputs have a single >

Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
For a pyspark job with 54 executors all of the task outputs have a single line in both the stderr and stdout similar to: Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/ Note: the directory /shared/sparkmaven/work exists and is owned by the same user running the

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread Dale Wang
The problem comes from the inconsistency between graph’s triplet view and vertex view. The message may not be lost but the message is just not sent in sendMsgfunction because sendMsg function gets wrong value of srcAttr! It is not a new bug. I met a similar bug that appeared in version 1.2.1

Is there any api for categorical column statistic ?

2016-11-23 Thread canan chen
DataSet.describe only calculate the statistics for numerical data, but not for categorical column. R's summary method can also calculate statistical for numerical data which is very useful for exploratory data analysis. Just wondering is there any api for categorical column statistics as well or

Re: spark.yarn.executor.memoryOverhead

2016-11-23 Thread Saisai Shao
>From my understanding, this memory overhead should include "spark.memory.offHeap.size", which means off-heap memory size should not be larger than the overhead memory size when running in yarn. On Thu, Nov 24, 2016 at 3:01 AM, Koert Kuipers wrote: > in YarnAllocator i see

RE: How to expose Spark-Shell in the production?

2016-11-23 Thread Shreya Agarwal
Use Livy out job server to execute spark-shell commands remotely Sent from my Windows 10 phone From: kant kodali Sent: Saturday, November 19, 2016 12:57 AM To: user @spark Subject: How to expose Spark-Shell in the production? How to

Re: Spark driver not reusing HConnection

2016-11-23 Thread Mukesh Jha
Corrosponding HBase bug: https://issues.apache.org/jira/browse/HBASE-12629 On Wed, Nov 23, 2016 at 1:55 PM, Mukesh Jha wrote: > The solution is to disable region size caluculation check. > > hbase.regionsizecalculator.enable: false > > On Sun, Nov 20, 2016 at 9:29 PM,

Re: Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
Sorry please ignore this if you like. Looks like the network throughput is very low but every worker/executor machine is indeed working. My current incoming Network throughput on each worker machine is about 2.5KB/s (Kilobyte per second) so this needs to go somewhere in 5MB-6MB/s and that means

Re: Spark driver not reusing HConnection

2016-11-23 Thread Mukesh Jha
The solution is to disable region size caluculation check. hbase.regionsizecalculator.enable: false On Sun, Nov 20, 2016 at 9:29 PM, Mukesh Jha wrote: > Any ideas folks? > > On Fri, Nov 18, 2016 at 3:37 PM, Mukesh Jha > wrote: > >> Hi >> >>

Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
Hi All, Spark Shell doesnt seem to use spark workers but Spark Submit does. I had the workers ips listed under conf/slaves file. I am trying to count number of rows in Cassandra using spark-shell so I do the following on spark master val df = spark.sql("SELECT test from hello") // This has

Mapping KMean trained-data to respective records

2016-11-23 Thread Reth RM
I am using wholeTextFiles api to load bunch of text files and (caching this object) mapping its text content to tf-idf vectors and then applying kmean on these vectors. The k-mean model after training, predicts the clusterId of trained data by taking list of training data, question is how to map

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-23 Thread Shixiong(Ryan) Zhu
Scala has not yet resolved this issue. Once they fix and release a new version, you can just upgrade the Scala version by yourself. On Tue, Nov 22, 2016 at 10:58 PM, Denis Bolshakov wrote: > Hello Zhu, > > Thank you very much for such detailed explanation and

spark.yarn.executor.memoryOverhead

2016-11-23 Thread Koert Kuipers
in YarnAllocator i see that memoryOverhead is by default set to math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) this does not take into account spark.memory.offHeap.size i think. should it? something like: math.max((MEMORY_OVERHEAD_FACTOR * executorMemory +

spark sql jobs heap memory

2016-11-23 Thread Koert Kuipers
we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we keep running into is containers getting killed by yarn. i realize this has to do with off-heap memory, and the suggestion is to increase spark.yarn.executor.memoryOverhead. at times our memoryOverhead is as large as the

Re: how to see Pipeline model information

2016-11-23 Thread Xiaomeng Wan
You can use pipelinemodel.stages(0).asInstanceOf[RandomForestModel]. The number (0 in example) for stages depends on the order you call setStages. Shawn On 23 November 2016 at 10:21, Zhiliang Zhu wrote: > > Dear All, > > I am building model by spark pipeline, and

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k
Created a JIRA for the same https://issues.apache.org/jira/browse/SPARK-18568 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28124.html Sent from the Apache Spark User List mailing

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k
Hi I am facing a similar issue. It's not that the message is getting lost or something. The vertex 1 attributes changes in super step 1 but when the sendMsg gets the vertex attribute from the edge triplet in the 2nd superstep it stills has the old value of vertex 1 and not the latest value. So

how to see Pipeline model information

2016-11-23 Thread Zhiliang Zhu
Dear All, I am building model by spark pipeline, and in the pipeline I used Random Forest Alg as its stage. If I just use Random Forest but not make it by way of pipeline, I could see the information about the forest by API as rfModel.toDebugString() and rfModel.toString() . However, while it

subtractByKey modifes values in the source RDD

2016-11-23 Thread Dmitry Dzhus
I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala 2.11.x: Relevant code: object Types { type ContentId = Int type ContentKey = Tuple2[Int, ContentId] type InternalContentId = Int } val inverseItemIDMap: RDD[(InternalContentId,

RE: CSV to parquet preserving partitioning

2016-11-23 Thread benoitdr
Best solution I've found so far (no shuffling and as many threads as input dirs) : Create an rdd of input dirs, with as many partitions as input dirs Transform it to an rdd of input files (preserving the partitions by dirs) Flat-map it with a custom csv parser Convert rdd to dataframe Write

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-23 Thread kant kodali
Hi Michael, Looks like all from_json functions will require me to pass schema and that can be little tricky for us but the code below doesn't require me to pass schema at all. import org.apache.spark.sql._ val rdd = df2.rdd.map { case Row(j: String) => j } spark.read.json(rdd).show() On Tue,