Tr this
spark.storage.memoryFraction0.9
On 31 Oct 2014 20:27, akhandeshi ami.khande...@gmail.com wrote:
Thanks for the pointers! I did tried but didn't seem to help...
In my latest try, I am doing spark-submit local
But see the same message in spark App ui (4040)
localhost
Hi,
Maybe it is a stupid question, but I am running spark on yarn. I request the
resources by the following command:
{code}
./spark-submit --master yarn-client --num-executors #number of worker
--executor-cores #number of cores. ...
{code}
However, after launching the task, I use /yarn node
Are these /vols formatted? You typically need to format and define a mount
point in /mnt for attached EBS volumes.
I’m not using the ec2 script, so I don’t know what is installed, but there’s
usually an HDFS info service running on port 50070. After changing
hdfs-site.xml, you have to restart
Hi Jean,
Thanks for reporting this. This is indeed a bug: some column types (Binary,
Array, Map and Struct, and unfortunately for some reason, Boolean), a
NoopColumnStats is used to collect column statistics, which causes this
issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
Great! Thanks.
Sent from my iPad
On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote:
Hi Jean,
Thanks for reporting this. This is indeed a bug: some column types (Binary,
Array, Map and Struct, and unfortunately for some reason, Boolean), a
NoopColumnStats is used to
join seems to me the proper approach followed by keying the fits by KeyID
and using combineByKey to choose the best -
I am implementing that now and will report on performance
On Fri, Oct 31, 2014 at 11:56 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
Does the following help?
Hi,
I get exactly the same error. It runs on my local machine but not on the
cluster. I am running the example pi.py example.
Best,
Tassilo
--
View this message in context:
Now I am getting to problems using:
distData = sc.textFile(sys.argv[2]).coalesce(10)
The problem is that it seems that Spark is trying to put all the data to RAM
first and then perform coalesce. Do you know if there is something that would
do coalesce on fly with for example fixed size of
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of
count ~ 100 million. The data size is 20GB and groupBy results in an RDD of
1061 keys with values being IterableTuple4String, Integer, Double,
String. The job runs on 3 hosts in a standalone setup with each host's
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit.
On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD
of count ~ 100 million. The data size is 20GB and groupBy results
coalesce() is a streaming operation if used without the second parameter,
it does not put all the data in RAM. If used with the second parameter
(shuffle = true), then it performs a shuffle, but still does not put all
the data in RAM.
On Sat, Nov 1, 2014 at 12:09 PM, jan.zi...@centrum.cz wrote:
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB))
but that just returns RDD[Row].
How do I combine them to get a SchemaRDD[Row]?
thanks
Daniel
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on
the results.
Matei
On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote:
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile(fileA),
Thanks Matei. What does unionAll do if the input RDD schemas are not 100%
compatible. Does it take the union of the columns and generalize the types?
thanks
Daniel
On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Try unionAll, which is a special method on SchemaRDDs
Hi there,
I am trying to run the example code pi.py on a cluster, however, I only got
it working on localhost. When trying to run in standalone mode,
./bin/spark-submit \
--master spark://[mymaster]:7077 \
examples/src/main/python/pi.py \
I get warnings about resources and memory (the
It does generalize types, but only on the intersection of the columns it seems.
There might be a way to get the union of the columns too using HiveQL. Types
generalize up with string being the most general.
Matei
On Nov 1, 2014, at 6:22 PM, Daniel Mahler dmah...@gmail.com wrote:
Thanks
I have same requirement of passing list of values to in clause, when i am
trying to do
i am getting below error
scala val longList = Seq[Expression](a, b)
console:11: error: type mismatch;
found : String(a)
required: org.apache.spark.sql.catalyst.expressions.Expression
val longList =
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059
On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Great! Thanks.
Sent from my iPad
On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote:
Hi Jean,
Thanks for reporting
Resurfacing the thread. Oom shouldn't be the norm for a common groupby /
sort use case in a framework that is leading in sorting bench marks? Or is
there something fundamentally wrong in the usage?
On 02-Nov-2014 1:06 am, Bharath Ravi Kumar reachb...@gmail.com wrote:
Hi,
I'm trying to run
Hi,
FYI as follows. Could you post your heap size settings as well your Spark app
code?
Regards
Arthur
3.1.3 Detail Message: Requested array size exceeds VM limit
The detail message Requested array size exceeds VM limit indicates that the
application (or APIs used by that application)
Hello,
I am trying to run Connected Component algorithm on a very big graph. In
practice I found that a small number of partition size would lead to OOM,
while a large number would cause various time out exceptions. Thus I wonder
how to estimate the number of partition of a graph in GraphX?
21 matches
Mail list logo