Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-02 Thread Ankur Dave
How large is your graph, and how much memory does your cluster have? We don't have a good way to determine the *optimal* number of partitions aside from trial and error, but to get the job to at least run to completion, it might help to use the MEMORY_AND_DISK storage level and a large number of

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-02 Thread James
Hello, We get a graph with 100B edges of nearly 800GB in gz format. We have 80 machines, each one has 60GB memory. I have not ever seen the program run to completion. Alcaid 2014-11-02 14:06 GMT+08:00 Ankur Dave ankurd...@gmail.com: How large is your graph, and how much memory does your

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Reynold Xin
None of your tuning will help here because the problem is actually the way you are saving the output. If you take a look at the stacktrace, it is trying to build a single string that is too large for the VM to allocate memory. The VM is actually not running out of memory, but rather, JVM cannot

Re: Submiting Spark application through code

2014-11-02 Thread Marius Soutier
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my Hadoop dependencies to run a SparkContext. In your build.sbt: org.apache.hadoop % hadoop-common % “... exclude(javax.servlet, servlet-api), org.apache.hadoop % hadoop-hdfs % “... exclude(javax.servlet, servlet-api”)

Spark on Yarn probably trying to load all the data to RAM

2014-11-02 Thread jan.zikes
Hi, I am using Spark on Yarn, particularly Spark in Python. I am trying to run: myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json) myrdd.getNumPartitions() Unfortunately it seems that Spark tries to load everything to RAM, or at least after while of running this everything slows down and

Re: Spark speed performance

2014-11-02 Thread jan.zikes
Thank you, I would expect it to work as you write, but I am probably experiencing it working other way. But now it seems that Spark is generally trying to fit everything to RAM. I run Spark on YARN and I have wraped this to another question: 

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav
did you create SQLContext? On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary abhinav.chowd...@gmail.com wrote: I have same requirement of passing list of values to in clause, when i am trying to do i am getting below error scala val longList = Seq[Expression](a, b) console:11: error: type

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Bharath Ravi Kumar
Thanks for responding. This is what I initially suspected, and hence asked why the library needed to construct the entire value buffer on a single host before writing it out. The stacktrace appeared to suggest that user code is not constructing the large buffer. I'm simply calling groupBy and

Re: ExecutorLostFailure (executor lost)

2014-11-02 Thread Akhil Das
You can check in the worker logs for more accurate information(that are found under the work directory inside spark directory). I used to hit this issue with: - Too many open files : Increasing the ulimit would solve this issue - Akka connection timeout/Framesize: Setting the following while

Re: Cannot instantiate hive context

2014-11-02 Thread Akhil Das
Adding the libthrift jar http://mvnrepository.com/artifact/org.apache.thrift/libthrift/0.9.0 in the class path would resolve this issue. Thanks Best Regards On Sat, Nov 1, 2014 at 12:34 AM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am trying to load hive datasets using

Re: hadoop_conf_dir when running spark on yarn

2014-11-02 Thread Akhil Das
You can set HADOOP_CONF_DIR inside the spark-env.sh file Thanks Best Regards On Sat, Nov 1, 2014 at 4:14 AM, ameyc ambr...@gmail.com wrote: How do i setup hadoop_conf_dir correctly when I'm running my spark job on Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Sean Owen
saveAsText means save every element of the RDD as one line of text. It works like TextOutputFormat in Hadoop MapReduce since that's what it uses. So you are causing it to create one big string out of each Iterable this way. On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar reachb...@gmail.com

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread ashu
Hi, Sorry to bounce back the old thread. What is the state now? Is this problem solved. How spark handle categorical data now? Regards, Ashutosh -- View this message in context:

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread Xiangrui Meng
This operation requires two transformers: 1) Indexer, which maps string features into categorical features 2) OneHotEncoder, which flatten categorical features into binary features We are working on the new dataset implementation, so we can easily express those transformations. Sorry for late!

How do I kill av job submitted with spark-submit

2014-11-02 Thread Steve Lewis
I see the job in the web interface but don't know how to kill it

Re: hadoop_conf_dir when running spark on yarn

2014-11-02 Thread Amey Chaugule
I thought that only applied when you're trying to run a job using spark-submit or in the shell... On Sun, Nov 2, 2014 at 8:47 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can set HADOOP_CONF_DIR inside the spark-env.sh file Thanks Best Regards On Sat, Nov 1, 2014 at 4:14 AM, ameyc

Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
Thanks Sean! My novice understanding is that the 'native heap' is the address space not allocated to the JVM heap, but I wanted to check to see if I'm missing something. I found out my issue appeared to be actual memory pressure on the executor machine. There was space for the JVM heap but not

Spark SQL takes unexpected time

2014-11-02 Thread Shailesh Birari
Hello, I have written an Spark SQL application which reads data from HDFS and query on it. The data size is around 2GB (30 million records). The schema and query I am running is as below. The query takes around 05+ seconds to execute. I tried by adding

Re: Does SparkSQL work with custom defined SerDe?

2014-11-02 Thread Chirag Aggarwal
Did https://issues.apache.org/jira/browse/SPARK-3807 fix the issue seen by you? If yes, then please note that it shall be part of 1.1.1 and 1.2 Chirag From: Chen Song chen.song...@gmail.commailto:chen.song...@gmail.com Date: Wednesday, 15 October 2014 4:03 AM To:

Spark cluster stability

2014-11-02 Thread jatinpreet
Hi, I am running a small 6 node spark cluster for testing purposes. Recently, one of the node's physical memory was filled up by temporary files and there was no space left on the disk. Due to this my Spark jobs started failing even though on the Spark Web UI the was shown 'Alive'. Once I logged

Re: Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Sean Owen
Yes, that's correct to my understanding and the probable explanation of your issue. There are no additional limits or differences from how the JVM works here. On Nov 3, 2014 4:40 AM, Paul Wais pw...@yelp.com wrote: Thanks Sean! My novice understanding is that the 'native heap' is the address

Re: Spark cluster stability

2014-11-02 Thread Akhil Das
You can enable monitoring (nagios) with alerts to tackle these kind of issues. Thanks Best Regards On Mon, Nov 3, 2014 at 10:55 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am running a small 6 node spark cluster for testing purposes. Recently, one of the node's physical memory was

Parquet files are only 6-20MB in size?

2014-11-02 Thread ag007
Hi there, I have a pySpark job that is simply taking a tab separated CSV outputting it to a Parquet file. The code is based on the SQL write parquet example. (Using a different inferred schema, only 35 columns). The input files range from 100MB to 12 Gb. I have tried different different block

graph x extracting the path

2014-11-02 Thread dizzy5112
Hi all, just wondering if there was a way to extract paths in graphx. For example if i have the graph attached i would like to return the results along the lines of : 101 - 103 101 -104 -108 102 -105 102 -106-107 http://apache-spark-user-list.1001560.n3.nabble.com/file/n17936/graph.jpg --