Re: CANNOT FIND ADDRESS

2014-11-01 Thread Akhil Das
Tr this spark.storage.memoryFraction0.9 On 31 Oct 2014 20:27, akhandeshi ami.khande...@gmail.com wrote: Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost

--executor-cores cannot change vcores in yarn?

2014-11-01 Thread Gen
Hi, Maybe it is a stupid question, but I am running spark on yarn. I request the resources by the following command: {code} ./spark-submit --master yarn-client --num-executors #number of worker --executor-cores #number of cores. ... {code} However, after launching the task, I use /yarn node

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-11-01 Thread Marius Soutier
Are these /vols formatted? You typically need to format and define a mount point in /mnt for attached EBS volumes. I’m not using the ec2 script, so I don’t know what is installed, but there’s usually an HDFS info service running on port 50070. After changing hdfs-site.xml, you have to restart

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Hi Jean, Thanks for reporting this. This is indeed a bug: some column types (Binary, Array, Map and Struct, and unfortunately for some reason, Boolean), a NoopColumnStats is used to collect column statistics, which causes this issue. Filed SPARK-4182 to track this issue, will fix this ASAP.

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Jean-Pascal Billaud
Great! Thanks. Sent from my iPad On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote: Hi Jean, Thanks for reporting this. This is indeed a bug: some column types (Binary, Array, Map and Struct, and unfortunately for some reason, Boolean), a NoopColumnStats is used to

Re: A Spark Design Problem

2014-11-01 Thread Steve Lewis
join seems to me the proper approach followed by keying the fits by KeyID and using combineByKey to choose the best - I am implementing that now and will report on performance On Fri, Oct 31, 2014 at 11:56 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Does the following help?

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-11-01 Thread TJ Klein
Hi, I get exactly the same error. It runs on my local machine but not on the cluster. I am running the example pi.py example. Best, Tassilo -- View this message in context:

Re: Spark speed performance

2014-11-01 Thread jan.zikes
Now I am getting to problems using: distData = sc.textFile(sys.argv[2]).coalesce(10)   The problem is that it seems that Spark is trying to put all the data to RAM first and then perform coalesce. Do you know if there is something that would do coalesce on fly with for example fixed size of

OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Hi, I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of count ~ 100 million. The data size is 20GB and groupBy results in an RDD of 1061 keys with values being IterableTuple4String, Integer, Double, String. The job runs on 3 hosts in a standalone setup with each host's

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit. On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of count ~ 100 million. The data size is 20GB and groupBy results

Re: Spark speed performance

2014-11-01 Thread Aaron Davidson
coalesce() is a streaming operation if used without the second parameter, it does not put all the data in RAM. If used with the second parameter (shuffle = true), then it performs a shuffle, but still does not put all the data in RAM. On Sat, Nov 1, 2014 at 12:09 PM, jan.zi...@centrum.cz wrote:

union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA),

Re: union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
Thanks Matei. What does unionAll do if the input RDD schemas are not 100% compatible. Does it take the union of the columns and generalize the types? thanks Daniel On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try unionAll, which is a special method on SchemaRDDs

org.apache.hadoop.security.UserGroupInformation.doAs Issue

2014-11-01 Thread TJ Klein
Hi there, I am trying to run the example code pi.py on a cluster, however, I only got it working on localhost. When trying to run in standalone mode, ./bin/spark-submit \ --master spark://[mymaster]:7077 \ examples/src/main/python/pi.py \ I get warnings about resources and memory (the

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
It does generalize types, but only on the intersection of the columns it seems. There might be a way to get the union of the columns too using HiveQL. Types generalize up with string being the most general. Matei On Nov 1, 2014, at 6:22 PM, Daniel Mahler dmah...@gmail.com wrote: Thanks

Re: Spark SQL : how to find element where a field is in a given set

2014-11-01 Thread abhinav chowdary
I have same requirement of passing list of values to in clause, when i am trying to do i am getting below error scala val longList = Seq[Expression](a, b) console:11: error: type mismatch; found : String(a) required: org.apache.spark.sql.catalyst.expressions.Expression val longList =

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059 On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Great! Thanks. Sent from my iPad On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote: Hi Jean, Thanks for reporting

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Resurfacing the thread. Oom shouldn't be the norm for a common groupby / sort use case in a framework that is leading in sorting bench marks? Or is there something fundamentally wrong in the usage? On 02-Nov-2014 1:06 am, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, I'm trying to run

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread arthur.hk.c...@gmail.com
Hi, FYI as follows. Could you post your heap size settings as well your Spark app code? Regards Arthur 3.1.3 Detail Message: Requested array size exceeds VM limit The detail message Requested array size exceeds VM limit indicates that the application (or APIs used by that application)

How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread James
Hello, I am trying to run Connected Component algorithm on a very big graph. In practice I found that a small number of partition size would lead to OOM, while a large number would cause various time out exceptions. Thus I wonder how to estimate the number of partition of a graph in GraphX?