Adding on what Thomas said. There have been a few bug fixes for s3a since
Hadoop 2.6.0 was released. One example is HADOOP-11446.
The fixes would be in Hadoop 2.7.0
Cheers
On Jan 27, 2015, at 1:41 AM, Thomas Demoor thomas.dem...@amplidata.com
wrote:
Spark uses the Hadoop filesystems.
Created SPARK-5427 for the addition of floor function.
On Mon, Jan 26, 2015 at 11:20 PM, Alexey Romanchuk
alexey.romanc...@gmail.com wrote:
I have tried select ceil(2/3), but got key not found: floor
On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote:
Have you tried floor()
Hi all,
I can run spark job pragmatically in j2SE with following code without any error:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class TestSpark {
public static void main() {
String
will be fixed by https://github.com/apache/spark/pull/4226
On Tue, Jan 27, 2015 at 8:17 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
In the spark 1.2.0, it requires the ratings should be a RDD of Rating or
tuple or list. However, the current example in the site use still RDD[array]
as the
Maybe it's caused by integer overflow, is it possible that one object
or batch bigger than 2G (after pickling)?
On Tue, Jan 27, 2015 at 7:59 AM, rok rokros...@gmail.com wrote:
I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
without problems. When I try to read it back
Hi Gerard,
As others has mentioned I believe you're hitting Mesos-1688, can you
upgrade to the latest Mesos release (0.21.1) and let us know if it resolves
your problem?
Thanks,
Tim
On Tue, Jan 27, 2015 at 10:39 AM, Sam Bessalah samkiller@gmail.com
wrote:
Hi Geraard,
isn't this the same
Hi Geraard,
isn't this the same issueas this?
https://issues.apache.org/jira/browse/MESOS-1688
On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources and not
Hi Team,
I need some help on writing a scala to bulk load some data into hbase.
*Env:*
hbase 0.94
spark-1.0.2
I am trying below code to just bulk load some data into hbase table “t1”.
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import
Here is the method signature used by HFileOutputFormat :
public void write(ImmutableBytesWritable row, KeyValue kv)
Meaning, KeyValue is expected, not Put.
On Tue, Jan 27, 2015 at 10:54 AM, Jim Green openkbi...@gmail.com wrote:
Hi Team,
I need some help on writing a scala to bulk load
To clarify: I'm currently working on this locally, running on a laptop and I
do not use Spark-submit (using Eclipse to run my applications currently).
I've tried running both on Mac OS X and in a VM running Ubuntu. Furthermore,
I've got the VM from a fellow worker which has no issues running his
Hopefully some very bad ugly bug that has been fixed already and that
will urge us to upgrade our infra?
Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0
Could be https://issues.apache.org/jira/browse/MESOS-1688 (fixed in Mesos
0.21)
On Mon, Jan 26, 2015 at 2:45 PM, Gerard Maas gerard.m...@gmail.com
The root cause for this probably because the identical “exprId” of the
“AttributeReference” existed while do self-join with “temp table” (temp table =
resolved logical plan).
I will do the bug fixing and JIRA creation.
Cheng Hao
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent:
Thanks Akhil!
1) I had to do sudo -u hdfs hdfs dfsadmin -safemode leave
a) I had created a user called hdfs with superuser privileges in Hue, hence
the double hdfs.
2) Lastly, I know this is getting a bit off topic, but this is my etc/hosts
file:
127.0.0.1 localhost.localdomain
Spark uses the Hadoop filesystems.
I assume you are trying to use s3n:// which, under the hood, uses the 3rd
party jets3t library. It is configured through the jets3t.properties file
(google hadoop s3n jets3t) which you should put on Spark's classpath. The
setting you are looking for is
Hi All,
I have a use case where I have an RDD (not a k,v pair) where I want to do a
combineByKey() operation. I can do that by creating an intermediate RDD of k,v
pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be
more efficient if I can avoid this intermediate RDD.
Thanks Ted. Could you give me a simple example to load one row data in
hbase? How should I generate the KeyValue?
I tried multiple times, and still can not figure it out.
On Tue, Jan 27, 2015 at 12:10 PM, Ted Yu yuzhih...@gmail.com wrote:
Here is the method signature used by HFileOutputFormat :
Have you looked at the `aggregate` function in the RDD API ?
If your way of extracting the “key” (identifier) and “value” (payload) parts of
the RDD elements is uniform (a function), it’s unclear to me how this would be
more efficient that extracting key and value and then using combine,
Ok, thanks for your reply!
On Tue, Jan 27, 2015 at 2:32 PM, Joseph Bradley jos...@databricks.com
wrote:
Hi Andres,
Currently, serializing the object is probably the best way to do it.
However, there are efforts to support actual model import/export:
Hi Antony,
If you look in the YARN NodeManager logs, do you see that it's killing the
executors? Or are they crashing for a different reason?
-Sandy
On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid
wrote:
Hi,
I am using spark.yarn.executor.memoryOverhead=8192 yet
Use combineByKey. For top 10 as an example (bottom 10 work similarly): add
the element to a list. If the list is larger than 10, delete the smallest
elements until size is back to 10.
-Sven
On Tue, Jan 27, 2015 at 3:35 AM, kundan kumar iitr.kun...@gmail.com wrote:
I have a an array of the form
I used below code, and it still failed with the same error.
Anyone has experience on bulk loading using scala?
Thanks.
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import
I download spark 1.2.0 on my windows server 2008. How do I start spark master?
I tried to run the following on command prompt
C:\spark-1.2.0-bin-hadoop2.4 bin\spark-class.cmd
org.apache.spark.deploy.master.Master
I got the error
else was unexpected at this time.
Ningjun
Hi Andres,
Currently, serializing the object is probably the best way to do it.
However, there are efforts to support actual model import/export:
https://issues.apache.org/jira/browse/SPARK-4587
https://issues.apache.org/jira/browse/SPARK-1406
I'm hoping to have the PR for the first JIRA ready
Hi,
I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors
crashed with this error.
does that mean I have genuinely not enough RAM or is this matter of config
tuning?
other config options used:spark.storage.memoryFraction=0.3
SPARK_EXECUTOR_MEMORY=14G
running spark 1.2.0 as
I configured 4 pcs with spark-1.2.0-bin-hadoop2.4.tgz with Spark Standalone
on a Cluster.
On master pc i executed:
./sbin/start-master.sh
./sbin/start-slaves.sh (4 instances)
On datanode1 ,datanode2, secondarymaster pcs i executed:
./bin/spark-class org.apache.spark.deploy.worker.Worker
We are running spark in Google Compute Engine using their One-Click Deploy.
By doing so, we get their Google Cloud Storage connector for hadoop for free
meaning we can specify gs:// paths for input and output.
We have jobs that take a couple of hours, end up with ~9k partitions which
means 9k
Hi,
in my Spark Streaming application, computations depend on users' input in
terms of
* user-defined functions
* computation rules
* etc.
that can throw exceptions in various cases (think: exception in UDF,
division by zero, invalid access by key etc.).
Now I am wondering about what is a
After slimming down the job quite a bit, it looks like a call to coalesce()
on a larger RDD can cause these Python worker spikes (additional details in
Jira:
Looks like the culprit is this error:
FileNotFoundException: File
file:/home/sparkuser/spark-1.2.0/spark-1.2.0-bin-hadoop2.4/data/cut/ratings.txt
does not exist
Mohammed
-Original Message-
From: riginos [mailto:samarasrigi...@gmail.com]
Sent: Tuesday, January 27, 2015 4:24 PM
To:
Hi, Jim
Your generated rdd should be the type of RDD[ImmutableBytesWritable, KeyValue],
while your current type goes to RDD[ImmutableBytesWritable, Put].
You can go like this and the result should be type of
RDD[ImmutableBytesWritable, KeyValue] that can be savaAsNewHadoopFile
val result =
Thanks, Ted. Kerberos is enabled on the cluster.
I'm new to the world of kerberos, so pease excuse my ignorance here. Do you
know if there are any additional steps I need to take in addition to
setting HADOOP_CONF_DIR? For instance, does hadoop.security.auth_to_local
require any specific setting
Thanks, Siddardha. I did but got the same error. Kerberos is enabled on my
cluster and I may be missing a configuration step somewhere.
--
View this message in context:
Absolute path means no ~ and also verify that you have the path to the file
correct. For some reason the Python code does not validate that the file
exists and will hang (this is the same reason why ~ hangs).
On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick pzybr...@gmail.com wrote:
Try using an
Hi Anthony,
What is the setting of the total amount of memory in MB that can be
allocated to containers on your NodeManagers?
yarn.nodemanager.resource.memory-mb
Can you check this above configuration in yarn-site.xml used by the node
manager process?
-Guru Medasani
From: Sandy Ryza
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be
I've read that this is supposed to be a rather significant optimization to
the shuffle system in 1.1.0 but I'm not seeing much documentation on
enabling this in Yarn. I see github classes for it in 1.2.0 and a property
spark.shuffle.service.enabled in the spark-defaults.conf.
The code mentions
Since it's an executor running OOM it doesn't look like a container being
killed by YARN to me. As a starting point, can you repartition your job
into smaller tasks?
-Sven
On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote:
Hi Anthony,
What is the setting of the total
Can you attach the logs where this is failing?
From: Sven Krasser kras...@gmail.com
Date: Tuesday, January 27, 2015 at 4:50 PM
To: Guru Medasani gdm...@outlook.com
Cc: Sandy Ryza sandy.r...@cloudera.com, Antony Mayi
antonym...@yahoo.com, user@spark.apache.org user@spark.apache.org
Subject:
I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and
yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the containers.
the stack trace is bellow.
thanks,Antony.
15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL
15: SIGTERM15/01/27
I have a an array of the form
val array: Array[(Int, (String, Int))] = Array(
(idx1,(word1,count1)),
(idx2,(word2,count2)),
(idx1,(word1,count1)),
(idx3,(word3,count1)),
(idx4,(word4,count4)))
I want to get the top 10 and bottom 10 elements from this array for each
index
Hello All, I am writing a simple Spark application to count UV(unique
view) from a log file。Below is my code,it is not right on the red line .My idea
here is same cookie on a host only count one .So i want to split the host
from the previous RDD. But now I don't know how to finish it
Hi all,
we are building a custom JDBC receiver that would create a stream from sql
tables. Not sure what is the best way to shut down the stream once all the
data goes through, as the receiver knows it is completed but it cannot
initiate the stream to shut down.
Any suggestion how to structure
Okay, I finally tried to change the Hadoop-client version from 2.4.0 to 2.5.2
and that mysteriously fixed everything..
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p21387.html
Sent from
Aha, you’re right, I did a wrong comparison, the reason might be only for
checkpointing :).
Thanks
Jerry
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Wednesday, January 28, 2015 10:39 AM
To: Shao, Saisai
Cc: user
Subject: Re: Why must the dstream.foreachRDD(...) parameter be
I download and install spark-1.2.0-bin-hadoop2.4.tgz pre-built version on
Windows 2008 R2 server. When I submit a job using spark-submit, I got the
following error
WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop
library for your platform
... using builtin-java
It is a Streaming application, so how/when do you plan to access the
accumulator on driver?
On Tue, Jan 27, 2015 at 6:48 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
thanks for your mail!
On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
That seems
I wanted to update this thread for others who may be looking for a solution
to his as well. I found [1] and I'm going to investigate if this is a
viable solution.
[1]
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
On Wed, Jan 28, 2015 at 12:51
On 1/27/15 5:55 PM, Cheng Lian wrote:
On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is
option
I have the same problem too.
org.apache.hadoop:hadoop-common:jar:2.4.0 brings
commons-configuration:commons-configuration:jar:1.6 but we're using
commons-configuration:commons 1.8
Is there any workaround for this?
Greg
--
View this message in context:
Hi,
On Wed, Jan 28, 2015 at 1:45 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
It is a Streaming application, so how/when do you plan to access the
accumulator on driver?
Well... maybe there would be some user command or web interface showing the
errors that have happened during
You can add exclusion to Spark pom.xml
Here is an example (for hadoop-client which brings in hadoop-common)
diff --git a/pom.xml b/pom.xml
index 05cb379..53947d9 100644
--- a/pom.xml
+++ b/pom.xml
@@ -632,6 +632,10 @@
scope${hadoop.deps.scope}/scope
exclusions
I need to be able to take an input RDD[Map[String,Any]] and split it into
several different RDDs based on some partitionable piece of the key
(groups) and then send each partition to a separate set of files in
different folders in HDFS.
1) Would running the RDD through a custom partitioner be the
Hi everyone,
Is there a way to save on disk the model to reuse it later? I could
serialize the object and save the bytes, but I'm guessing there might be a
better way to do so.
Has anyone tried that?
Andres.
--
View this message in context:
I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
without problems. When I try to read it back in, it fails with:
Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most
recent failure: Lost task 401.3 in stage 0.0 (TID 449,
e1326.hpc-lca.ethz.ch):
Hi,
Did you try asking this on StackOverflow?
http://stackoverflow.com/questions/tagged/apache-spark
I'd also suggest adding some sample data to help others understanding your
logic.
-kr, Gerard.
On Tue, Jan 27, 2015 at 1:14 PM, 老赵 laozh...@sina.cn wrote:
Hello All,
I am writing a simple
Caused by: java.lang.IllegalArgumentException: Invalid rule: L
RULE:[2:$1@$0](.*@XXXCOMPANY.COM http://xxxcompany.com/)s/(.*)@
XXXCOMPANY.COM/$1/L http://xxxcompany.com/$1/L
DEFAULT
Can you put the rule on a single line (not sure whether there is newline or
space between L and DEFAULT) ?
Looks
There is also SPARK-3533 https://issues.apache.org/jira/browse/SPARK-3533,
which proposes to add a convenience method for this.
On Mon Jan 26 2015 at 10:38:56 PM Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
This might be helpful:
For those who found that absolute vs. relative path for the pem file
mattered, what OS and shell are you using? What version of Spark are you
using?
~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to the
absolute path before sending it to spark-ec2. (i.e. tilde expansion.)
Hi,
In the spark 1.2.0, it requires the ratings should be a RDD of Rating or
tuple or list. However, the current example in the site use still
RDD[array] as the ratings. Therefore, the example doesn't work under the
version 1.2.0.
May be we should update the documentation of the site?
Thanks a
Hey Alexey,
You need to use |HiveContext| in order to access Hive UDFs. You may try
it with |bin/spark-sql| (|src| is a Hive table):
|spark-sql select key / 3 from src limit 10;
79.33
28.668
103.67
9.0
55.0
136.34
85.0
92.67
I have about 15 -20 joins to perform. Each of these tables are in the order
of 6 million to 66 million rows. The number of columns range from 20 are
400.
I read the parquet files and obtain schemaRDDs.
Then use join functionality on 2 SchemaRDDs.
I join the previous join results with the next
I have about 15 -20 joins to perform. Each of these tables are in the order
of 6 million to 66 million rows. The number of columns range from 20 are
400.
I read the parquet files and obtain schemaRDDs.
Then use join functionality on 2 SchemaRDDs.
I join the previous join results with the next
This renaming from _temporary to the final location is actually done by
executors, in parallel, for saveAsTextFile. It should be performed by each
task individually before it returns.
I have seen an issue similar to what you mention dealing with Hive code
which did the renaming serially on the
I believe this is needed for driver recovery in Spark Streaming. If your Spark
driver program crashes, Spark Streaming can recover the application by reading
the set of DStreams and output operations from a checkpoint file (see
Hey Tobias,
I think one consideration is for checkpoint of DStream which guarantee driver
fault tolerance.
Also this `foreachFunc` is more like an action function of RDD, thinking of
rdd.foreach(func), in which `func` need to be serializable. So maybe I think
your way of use it is not a
Hi,
thanks for your mail!
On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
That seems reasonable to me. Are you having any problems doing it this way?
Well, actually I haven't done that yet. The idea of using accumulators to
collect errors just came while
I'm not sure how to confirm how the moving is happening, however, one of
the jobs just completed that I was talking about with 9k files of 4mb each.
Spark UI showed the job being complete after ~2 hours. The last four hours
of the job was just moving the files from _temporary to their final
Hi,
I want to do something like
dstream.foreachRDD(rdd = if (someCondition) ssc.stop())
so in particular the function does not touch any element in the RDD and
runs completely within the driver. However, this fails with a
NotSerializableException because $outer is not serializable etc. The
Hi,
thanks for the answers!
On Wed, Jan 28, 2015 at 11:31 AM, Shao, Saisai saisai.s...@intel.com
wrote:
Also this `foreachFunc` is more like an action function of RDD, thinking
of rdd.foreach(func), in which `func` need to be serializable. So maybe I
think your way of use it is not a normal
Never mind, the problem is that JAVA is not installed on windows. I install
JAVA and the problem go away.
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent:
UDAF is a WIP, at least from API user's perspective as there is no public
API to my knowledge.
https://issues.apache.org/jira/browse/SPARK-3947
Thanks
On Tue, Jan 27, 2015 at 12:26 PM, sunwei hisun...@outlook.com wrote:
Hi, any one can show me some examples using UDAF for spark sqlcontext?
71 matches
Mail list logo