Sorry i sent the wrong join code snippet, the actual snippet is
ggImpsDf.join(
aggRevenueDf,
aggImpsDf("id_1") <=> aggRevenueDf("id_1")
&& aggImpsDf("id_2") <=> aggRevenueDf("id_2")
&& aggImpsDf("day_hour") <=> aggRevenueDf("day_hour")
&& aggImpsDf("day_hour_2") <=>
Hi All,
got this weird error when I tried to run spark on YARN-CLUSTER mode , I have
33 files and I am looping spark in bash one by one most of them worked ok
except few files.
Is this below error HDFS or spark error ?
Exception in thread "Driver" java.lang.IllegalArgumentException: Pathname
Check what you have at SimpleMktDataFlow.scala:106
~Pratik
On Fri, Oct 23, 2015 at 11:47 AM kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Full Error:-
> at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195)
> at
>
>
oh no wonder... it undoes the glob (i was reading from /some/path/*),
creates a hadoopRdd for every path, and then creates a union of them using
UnionRDD.
thats not what i want... no need to do union. AvroInpuFormat already has
the ability to handle globs (or multiple paths comma separated) very
I had face a similar issue. The actual problem was not in the file name.
We run Spark on Yarn. The actual problem was seen in the logs by running
the command:
$ yarn logs -applicationId
Scroll from the beginning to know the actual error.
~Pratik
On Fri, Oct 23, 2015 at 11:40 AM
I need to run Spark Job as a service in my project, so there is a
"ServiceManager" in it and it use
SparkLauncher(org.apache.spark.launcher.SparkLauncher) to submit Spark jobs.
First, I tried to write a demo, putting only the SparkLauncher codes in the
main and run it with java -jar, it's
Thanks for your advice, Jem. :)
I will increase the partitioning and see if it helps.
Best,
Yifan LI
> On 23 Oct 2015, at 12:48, Jem Tucker wrote:
>
> Hi Yifan,
>
> I think this is a result of Kryo trying to seriallize something too large.
> Have you tried to
Hi Sander,
Thank you for your very informative email. From your email, I've learned a
quite a bit.
>>>Is the condition determined somehow from the data coming through
streamLogs, and is newData streamLogs again (rather than a whole data
source?)
No, they are two different Streams. I have two
Hi, I run spark to write data to hbase, but found NoSuchMethodException:
15/10/23 18:45:21 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
dn18-formal.i.nease.net): java.lang.NoSuchMethodError:
com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
I found
Hello!
How to adjust the memory settings properly for SparkR with master="local[*]"
in R?
*When running from R -- SparkR doesn't accept memory settings :(*
I use the following commands:
R> library(SparkR)
R> sc <- sparkR.init(master = "local[*]", sparkEnvir =
list(spark.driver.memory =
Hi,
I created 2 workers on same machine each with 4 cores and 6GB ram
I submitted first job, and it allocated 2 cores on each of the worker
processes, and utilized full 4 GB ram for each executor process
When i submit my second job it always say in WAITING state.
Cheers!!
On Tue, Oct 20,
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
I am using spark 1.5.1 with YARN.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
if
Hi Bin,
Very likely the RedisClientPool is being closed too quickly before map has
a chance to get to it. One way to verify would be to comment out the .close
line and see what happens. FWIW I saw a similar problem writing to Solr
where I put a commit where you have a close, and noticed that the
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on
the same spark context you created (executors and cores)...same idea is
used for spark sql thriftserver flow...
For streaming i think it lets you run only one stream at a time even if you
run them on multiple threads on
Hi Matej,
I'm also using this and I'm having the same behavior here, my driver has
only 530mb which is the default value.
Maybe this is a bug.
2015-10-23 9:43 GMT-02:00 Matej Holec :
> Hello!
>
> How to adjust the memory settings properly for SparkR with
> master="local[*]"
>
This doesn't show the actual error output from Maven. I have a strong
guess that you haven't set MAVEN_OPTS to increase the memory Maven can
use.
On Fri, Oct 23, 2015 at 6:14 AM, Kayode Odeyemi wrote:
> Hi,
>
> I can't seem to get a successful maven build. Please see command
do you have JAVA_HOME set to a java 7 jdk?
2015-10-23 7:12 GMT-04:00 emlyn :
> xjlin0 wrote
> > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> > or without Hadoop or home compiled with ant or maven). There was no
> error
> > message in v1.4.x,
Spark asked YARN to let an executor use 7GB of memory, but it used
more so was killed. In each case you see that the exectuor memory plus
overhead equals the YARN allocation requested. What's the issue with
that?
On Fri, Oct 23, 2015 at 6:46 AM, JoneZhang wrote:
> Here is
Both Spark 1.5 and 1.5.1 are released so it certainly shouldn't be a problem
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
--
View this message in context:
If you can reproduce, then i think you can open up a jira for this.
Thanks
Best Regards
On Fri, Oct 23, 2015 at 1:37 PM, Eugen Cepoi wrote:
> When fixing the port to the same values as in the stack trace it works
> too. The network config of the slaves seems correct.
>
>
JAVA_HOME is unset.
I've also tried setting it with:
export JAVA_HOME=$(/usr/libexec/java_home)
which sets it to
"/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home" and I
still get the same problem.
On 23 October 2015 at 14:37, Jonathan Coveney wrote:
> do you
Thanks for the advice. In my case it turned out to be two issues.
- use Java rather than Scala to launch the process, putting the core Scala libs
on the class path.
- I needed a merge strategy of Concat for reference.conf files in my build.sbt
Regards,
Mike
> On 23 Oct 2015, at 01:00, Ted Yu
Here is the spark configure and error log
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors10
spark.executor.cores1
spark.executor.memory 6G
Hi,
I can't seem to get a successful maven build. Please see command output
below:
bash-3.2$ ./make-distribution.sh --name spark-latest --tgz --mvn mvn
-Dhadoop.version=2.7.0 -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests
clean package
+++ dirname ./make-distribution.sh
++ cd .
++ pwd
+
Hi there,
we have a set of relatively light weight jobs that we would like to run
repeatedly on our Spark cluster.
The situation is as follows. we have a reliable source of data, a Cassandra
database. One table contains time series data, which we would like to
analyse. To do so we read a window
xjlin0 wrote
> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> or without Hadoop or home compiled with ant or maven). There was no error
> message in v1.4.x, system prompt nothing. On v1.5.x, once I enter
> $SPARK_HOME/bin/pyspark or spark-shell, I got
>
> Error:
Hi,
I have a big sorted RDD sRdd(~962million elements), and need to scan its
elements in order(using sRdd.toLocalIterator).
But the process failed when the scanning was done after around 893million
elements, returned with following exception:
Anyone has idea? Thanks!
Exception in thread
Sandy
The assembly jar does contain org.apache.spark.deploy.yarn.ExecutorLauncher.
I am trying to find out how i can increase the logging level, so I know the
exact classpath used by Yarn ContainerLaunch.
Deenar
On 23 October 2015 at 03:30, Sandy Ryza wrote:
> Hi
Hi Sujit, and All,
Currently I lost in large difficulty, I am eager to get some help from you.
There is some big linear system of equations as:Ax = b, A with N number of row
and N number of column, N is very large, b = [0, 0, ..., 0, 1]TThen, I will
sovle it to get x = [x1, x2, ..., xn]T.
The
Hi Yifan,
I think this is a result of Kryo trying to seriallize something too large.
Have you tried to increase your partitioning?
Cheers,
Jem
On Fri, Oct 23, 2015 at 11:24 AM Yifan LI wrote:
> Hi,
>
> I have a big sorted RDD sRdd(~962million elements), and need to scan
Hi Yifan,
You could also try increasing the spark.kryoserializer.buffer.max.mb
*spark.kryoserializer.buffer.max.mb *(64 Mb by default) : useful if your
default buffer size goes further than 64 Mb;
Per doc:
Maximum allowable size of Kryo serialization buffer. This must be larger
than any object
just try dropping in that Jar. Hadoop core ships with an out of date guava JAR
to avoid breaking old code downstream, but 2.7.x is designed to work with later
versions too (i.e. it has moved off any of the now-removed methods. See
https://issues.apache.org/jira/browse/HADOOP-10101 for the
You can do the following. Start the spark-shell. Register the UDFs in the
shell using sqlContext, then start the Thrift Server using startWithContext
from the spark shell:
Hi Zhiliang,
For a system of equations AX = y, Linear Regression will give you a
best-fit estimate for A (coefficient vector) for a matrix of feature
variables X and corresponding target variable y for a subset of your data.
OTOH, what you are looking for here is to solve for x a system of
Hello.
I have activated the file checkpointing for a DStream to unleach the
updateStateByKey.
My unit test worked with no problem but when I have integrated this in my
full stream I got this exception. :
java.io.NotSerializableException: DStream checkpointing has been enabled but
the DStreams
I got this working. For others trying this It turns out in Spark 1.3/CDH5.4
spark.yarn.jar=local:/opt/cloudera/parcels/
I had changed this to reflect the 1.5.1 version of spark assembly jar
spark.yarn.jar=/opt/spark-1.5.1-bin/...
and this didn't work, I had to drop the "local:" prefix
https://github.com/databricks/spark-avro/pull/95
On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote:
> oh no wonder... it undoes the glob (i was reading from /some/path/*),
> creates a hadoopRdd for every path, and then creates a union of them using
> UnionRDD.
>
> thats
Mind sharing your code, if possible ?
Thanks
On Fri, Oct 23, 2015 at 9:49 AM, crakjie wrote:
> Hello.
>
> I have activated the file checkpointing for a DStream to unleach the
> updateStateByKey.
> My unit test worked with no problem but when I have integrated this in my
> full
i noticed in the comments for HadoopFsRelation.buildScan it says:
* @param inputFiles For a non-partitioned relation, it contains paths of
all data files in the
*relation. For a partitioned relation, it contains paths of all
data files in a single
*selected partition.
do i
If I have two columns
StructType(Seq(
StructField("id", LongType),
StructField("phones", ArrayType(StringType
I want to add index for “phones” before I explode it.
Can this be implemented as GenericUDF?
I tried DataFrame.explode. It worked for simple types like string, but I
could not
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review,
The user facing type mapping is documented here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types
On Fri, Oct 23, 2015 at 12:10 PM, Benyi Wang wrote:
> If I have two columns
>
> StructType(Seq(
> StructField("id", LongType),
>
How did you specify number of cores each executor can use?
Be sure to use this when submitting jobs with spark-submit:
*--total-executor-cores
100.*
Other options won't work from my experience.
On Fri, Oct 23, 2015 at 8:36 AM, gaurav sharma
wrote:
> Hi,
>
> I created
Hi Sujit ,
Firstly, I must show my deep appreciation and respect towards your kind help
and excellent knowledge.It would be the best if you and me are in the same
place then I shall specially go to express my thanks and respect to you.
I will try your way by spark mllib SVD .
For Linear
Don't use groupBy , use reduceByKey instead , groupBy should always be
avoided as it leads to lot of shuffle reads/writes.
On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya
wrote:
> Sorry i sent the wrong join code snippet, the actual snippet is
>
> ggImpsDf.join(
>
Take a look at first section of https://spark.apache.org/community
On Fri, Oct 23, 2015 at 1:46 PM, wrote:
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
>
I need to save the twitter status I receive so that I can do additional
batch based processing on them in the future. Is it safe to assume HDFS is
the best way to go?
Any idea what is the best way to save twitter status to HDFS?
JavaStreamingContext ssc = new JavaStreamingContext(jsc,
I have not been able to run spark-shell in yarn-cluster mode since 1.5.0 due to
the same issue described by [SPARK-9776]. Did this pull request fix the issue?
https://github.com/apache/spark/pull/8947
I still have the same problem with 1.5.1 (I am running on HDP 2.2.6 with Hadoop
2.6)
Thanks.
Hi Shane,
Tachyon provides an api to get the block locations of the file which Spark
uses when scheduling tasks.
Hope this helps,
Calvin
On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane
wrote:
> Hi all,
>
>
>
> I am looking into how Spark handles data locality wrt
I saw this when I tested manually (without ./make-distribution)
Detected Maven Version: 3.2.2 is not in the allowed range 3.3.3.
So I simply upgraded maven to 3.3.3.
Resolved. Thanks
On Fri, Oct 23, 2015 at 3:17 PM, Sean Owen wrote:
> This doesn't show the actual error
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
if (mapJavaRDD != null) {
Thanks for the suggestion.
1. Heartbeat:
As a matter of fact, the heartbeat solution is what I thought of as well.
However that needs to be outside spark-streaming.
Furthermore, it cannot be generalized to all spark applications. For, e.g.
I am doing several filtering operations before I reach
emlyn wrote
>
> xjlin0 wrote
>> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
>> or without Hadoop or home compiled with ant or maven). There was no
>> error message in v1.4.x, system prompt nothing. On v1.5.x, once I enter
>> $SPARK_HOME/bin/pyspark or spark-shell, I
Hi all,
I am looking into how Spark handles data locality wrt Tachyon. My main concern
is how this is coordinated. Will it send a task based on a file loaded from
Tachyon to a node that it knows has that file locally and how does it know
which nodes has what?
Kind regards,
Shane
This email
Are you sure RedisClientPool is being initialized properly in the
constructor of RedisCache? Can you please copy paste the code that you use
to initialize RedisClientPool inside the constructor of RedisCache?
Thanks,
Aniket
On Fri, Oct 23, 2015 at 11:47 AM Bin Wang wrote:
>
It's an open issue : https://issues.apache.org/jira/browse/SPARK-4587
That's being said, you can workaround the issue by serializing the Model
(simple java serialization) and then restoring it before calling the
predicition job.
Best Regards,
On 22/10/2015 14:33, Sebastian Kuepers wrote:
>
I use mapPartitions to open connections to Redis, I write it like this:
val seqs = lines.mapPartitions { lines =>
val cache = new RedisCache(redisUrl, redisPort)
val result = lines.map(line => Parser.parseBody(line, cache))
cache.redisPool.close
result
}
But it
BTW, "lines" is a DStream.
Bin Wang 于2015年10月23日周五 下午2:16写道:
> I use mapPartitions to open connections to Redis, I write it like this:
>
> val seqs = lines.mapPartitions { lines =>
> val cache = new RedisCache(redisUrl, redisPort)
> val result = lines.map(line
Mostly a network issue, you need to check your network configuration from
the aws console and make sure the ports are accessible within the cluster.
Thanks
Best Regards
On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi wrote:
> Huh indeed this worked, thanks. Do you know why
Hi, Sebastian,
To use private APIs, you have to be very familiar with the code path;
otherwise, it is very easy to hit an exception or a bug.
My suggestion is to use IntelliJ to step-by-step step in the
function hiveContext.sql until you hit the parseSql API. Then, you will
know if you have to
Actually the groupBy is not taking a lot of time.
The join that i do later takes the most (95 %) amount of time.
Also, the grouping i am doing is based on the DataFrame api, which does not
contain any function for reduceBy... i guess the DF automatically uses
reduce by when we do a group by.
Hi I am having weird issue I have a Spark job which has bunch of
hiveContext.sql() and creates ORC files as part of hive tables with
partitions and it runs fine in 1.4.1 and hadoop 2.4.
Now I tried to move to Spark 1.5.1/hadoop 2.6 Spark job does not work as
expected it does not created ORC
in rdd map function, is there a way i can know the list of host names where
the map runs ? any code sample would be appreciated ?
thx,
Weide
Can you outline your use case a bit more ?
Do you want to know all the hosts which would run the map ?
Cheers
On Fri, Oct 23, 2015 at 5:16 PM, weoccc wrote:
> in rdd map function, is there a way i can know the list of host names
> where the map runs ? any code sample would
yea,
my use cases is that i want to have some external communications where rdd
is being run in map. The external communication might be handled separately
transparent to spark. What will be the hacky way and nonhacky way to do
that ? :)
Weide
On Fri, Oct 23, 2015 at 5:32 PM, Ted Yu
Hi,
Here's my situation, I have some kind of offline dataset, but I want to
form a virtual data stream feeding to Spark Streaming, my code looks like
this
// sort offline data by time
1) JavaRDD sortedByTime = offlineDataRDD.sortBy( );
// compute a list of JavaRDD, each element
When I use that I get a "Caused by: org.postgresql.util.PSQLException:
ERROR: column "none" does not exist"
On Thu, Oct 22, 2015 at 9:31 PM, Kayode Odeyemi wrote:
> Hi,
>
> I've trying to load a postgres table using the following expression:
>
> val cachedIndex =
When fixing the port to the same values as in the stack trace it works too.
The network config of the slaves seems correct.
Thanks,
Eugen
2015-10-23 8:30 GMT+02:00 Akhil Das :
> Mostly a network issue, you need to check your network configuration from
> the aws
There is a heartbeat stream pattern that you can use: Create a service
(perhaps a thread in your driver) that pushes a heartbeat event to a
different stream every N seconds. Consume that stream as well in your
streaming application, and perform an action on every heartbeat.
This has worked well
Have a look at this: https://github.com/koeninger/kafka-exactly-once
especially:
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala
Hello,
Data about my spark job is below. My source data is only 916MB (stage 0)
and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a
very long time and as i see the shuffled data is 614GB. Is this something
expected? Both the data sets produce 200 partitions.
Stage
Hi all,
I am running a simple word count job on a cluster of 4 nodes (24 cores per
node). I am varying two parameter in the configuration,
spark.python.worker.memory and the number of partitions in the RDD. My job
is written in python.
I am observing a discontinuity in the run time of the job
You might be referring to some class level variables from your code.
I got to see the actual field which caused the error when i marked the
class as serializable and run it on cluster.
class MyClass extends java.io.Serializable
The following resources will also help:
Full Error:-
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:104)
at
74 matches
Mail list logo