date:20161123

Yarn resource utilization with Spark pipe()

2016-11-23 Thread Sameer Choudhary

Hi, I am working on an Spark 1.6.2 application on YARN managed EMR cluster that uses RDD's pipe method to process my data. I start a light weight daemon process that starts processes for each task via pipes. This is to ensure that I don't run into https://issues.apache.org/jira/browse/SPARK-671.

Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch

This problem appears to be a regression on HEAD/master: when running against 2.0.2 the pyspark job completes successfully including running predictions. 2016-11-23 19:36 GMT-08:00 Stephen Boesch : > > For a pyspark job with 54 executors all of the task outputs have a single >

Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch

For a pyspark job with 54 executors all of the task outputs have a single line in both the stderr and stdout similar to: Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/ Note: the directory /shared/sparkmaven/work exists and is owned by the same user running the

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread Dale Wang

The problem comes from the inconsistency between graph’s triplet view and vertex view. The message may not be lost but the message is just not sent in sendMsgfunction because sendMsg function gets wrong value of srcAttr! It is not a new bug. I met a similar bug that appeared in version 1.2.1

Is there any api for categorical column statistic ?

2016-11-23 Thread canan chen

DataSet.describe only calculate the statistics for numerical data, but not for categorical column. R's summary method can also calculate statistical for numerical data which is very useful for exploratory data analysis. Just wondering is there any api for categorical column statistics as well or

Re: spark.yarn.executor.memoryOverhead

2016-11-23 Thread Saisai Shao

>From my understanding, this memory overhead should include "spark.memory.offHeap.size", which means off-heap memory size should not be larger than the overhead memory size when running in yarn. On Thu, Nov 24, 2016 at 3:01 AM, Koert Kuipers wrote: > in YarnAllocator i see

RE: How to expose Spark-Shell in the production?

2016-11-23 Thread Shreya Agarwal

Use Livy out job server to execute spark-shell commands remotely Sent from my Windows 10 phone From: kant kodali Sent: Saturday, November 19, 2016 12:57 AM To: user @spark Subject: How to expose Spark-Shell in the production? How to

Re: Spark driver not reusing HConnection

2016-11-23 Thread Mukesh Jha

Corrosponding HBase bug: https://issues.apache.org/jira/browse/HBASE-12629 On Wed, Nov 23, 2016 at 1:55 PM, Mukesh Jha wrote: > The solution is to disable region size caluculation check. > > hbase.regionsizecalculator.enable: false > > On Sun, Nov 20, 2016 at 9:29 PM,

Re: Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali

Sorry please ignore this if you like. Looks like the network throughput is very low but every worker/executor machine is indeed working. My current incoming Network throughput on each worker machine is about 2.5KB/s (Kilobyte per second) so this needs to go somewhere in 5MB-6MB/s and that means

Re: Spark driver not reusing HConnection

2016-11-23 Thread Mukesh Jha

The solution is to disable region size caluculation check. hbase.regionsizecalculator.enable: false On Sun, Nov 20, 2016 at 9:29 PM, Mukesh Jha wrote: > Any ideas folks? > > On Fri, Nov 18, 2016 at 3:37 PM, Mukesh Jha > wrote: > >> Hi >> >>

Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali

Hi All, Spark Shell doesnt seem to use spark workers but Spark Submit does. I had the workers ips listed under conf/slaves file. I am trying to count number of rows in Cassandra using spark-shell so I do the following on spark master val df = spark.sql("SELECT test from hello") // This has

Mapping KMean trained-data to respective records

2016-11-23 Thread Reth RM

I am using wholeTextFiles api to load bunch of text files and (caching this object) mapping its text content to tf-idf vectors and then applying kmean on these vectors. The k-mean model after training, predicts the clusterId of trained data by taking list of training data, question is how to map

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-23 Thread Shixiong(Ryan) Zhu

Scala has not yet resolved this issue. Once they fix and release a new version, you can just upgrade the Scala version by yourself. On Tue, Nov 22, 2016 at 10:58 PM, Denis Bolshakov wrote: > Hello Zhu, > > Thank you very much for such detailed explanation and

spark.yarn.executor.memoryOverhead

2016-11-23 Thread Koert Kuipers

in YarnAllocator i see that memoryOverhead is by default set to math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) this does not take into account spark.memory.offHeap.size i think. should it? something like: math.max((MEMORY_OVERHEAD_FACTOR * executorMemory +

spark sql jobs heap memory

2016-11-23 Thread Koert Kuipers

we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we keep running into is containers getting killed by yarn. i realize this has to do with off-heap memory, and the suggestion is to increase spark.yarn.executor.memoryOverhead. at times our memoryOverhead is as large as the

Re: how to see Pipeline model information

2016-11-23 Thread Xiaomeng Wan

You can use pipelinemodel.stages(0).asInstanceOf[RandomForestModel]. The number (0 in example) for stages depends on the order you call setStages. Shawn On 23 November 2016 at 10:21, Zhiliang Zhu wrote: > > Dear All, > > I am building model by spark pipeline, and

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k

Created a JIRA for the same https://issues.apache.org/jira/browse/SPARK-18568 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28124.html Sent from the Apache Spark User List mailing

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k

Hi I am facing a similar issue. It's not that the message is getting lost or something. The vertex 1 attributes changes in super step 1 but when the sendMsg gets the vertex attribute from the edge triplet in the 2nd superstep it stills has the old value of vertex 1 and not the latest value. So

how to see Pipeline model information

2016-11-23 Thread Zhiliang Zhu

Dear All, I am building model by spark pipeline, and in the pipeline I used Random Forest Alg as its stage. If I just use Random Forest but not make it by way of pipeline, I could see the information about the forest by API as rfModel.toDebugString() and rfModel.toString() . However, while it

subtractByKey modifes values in the source RDD

2016-11-23 Thread Dmitry Dzhus

I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala 2.11.x: Relevant code: object Types { type ContentId = Int type ContentKey = Tuple2[Int, ContentId] type InternalContentId = Int } val inverseItemIDMap: RDD[(InternalContentId,

RE: CSV to parquet preserving partitioning

2016-11-23 Thread benoitdr

Best solution I've found so far (no shuffling and as many threads as input dirs) : Create an rdd of input dirs, with as many partitions as input dirs Transform it to an rdd of input files (preserving the partitions by dirs) Flat-map it with a custom csv parser Convert rdd to dataframe Write

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-23 Thread kant kodali

Hi Michael, Looks like all from_json functions will require me to pass schema and that can be little tricky for us but the code below doesn't require me to pass schema at all. import org.apache.spark.sql._ val rdd = df2.rdd.map { case Row(j: String) => j } spark.read.json(rdd).show() On Tue,

Yarn resource utilization with Spark pipe()

Re: Invalid log directory running pyspark job

Invalid log directory running pyspark job

Re: GraphX Pregel not update vertex state properly, cause messages loss

Is there any api for categorical column statistic ?

Re: spark.yarn.executor.memoryOverhead

RE: How to expose Spark-Shell in the production?

Re: Spark driver not reusing HConnection

Re: Spark Shell doesnt seem to use spark workers but Spark Submit does.

Re: Spark driver not reusing HConnection

Spark Shell doesnt seem to use spark workers but Spark Submit does.

Mapping KMean trained-data to respective records

Re: Pasting into spark-shell doesn't work for Databricks example

spark.yarn.executor.memoryOverhead

spark sql jobs heap memory

Re: how to see Pipeline model information

Re: GraphX Pregel not update vertex state properly, cause messages loss

Re: GraphX Pregel not update vertex state properly, cause messages loss

how to see Pipeline model information

subtractByKey modifes values in the source RDD

RE: CSV to parquet preserving partitioning

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

22 matches

Site Navigation

Mail list logo

Footer information