I closed the Spark Shell and tried but no change.
Here is the error:
.
15/01/17 14:33:39 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:59791
15/01/17 14:33:39 INFO Server: jetty-8.y.z-SNAPSHOT
15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED
Hi,
I am running a Spark job. I get the output correctly but when I see the
logs file I see the following:
AbstractLifeCycle: FAILED.: java.net.BindException: Address already in
use...
What could be the reason for this?
Thank You
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set
*spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter
is lose efficacy!!!
2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com:
If you're talking about filter pushdowns for parquet files
I had the Spark Shell running through out. Is it because of that?
On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote:
Was there another instance of Spark running on the same machine ?
Can you pastebin the full stack trace ?
Cheers
On Mon, Jan 19, 2015 at 8:11 PM, Deep
Yes, I have increased the driver memory in spark-default.conf to 2g. Still
the error persists.
On Tue, Jan 20, 2015 at 10:18 AM, Ted Yu yuzhih...@gmail.com wrote:
Have you seen these threads ?
http://search-hadoop.com/m/JW1q5tMFlb
http://search-hadoop.com/m/JW1q5dabji1
Cheers
On Mon, Jan
I just checked the post. do you need help still ?
I think getAs(Seq[String]) should help.
If you are still stuck let me know.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html
Sent from
Hi,
On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng pc...@uow.edu.au wrote:
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
Yes, actually that is what I mean exactly. And maybe you missed my last
response, you can use the API:
jsonRDD(json:RDD[String], schema:StructType)
to clearly clarify your schema. For numbers bigger than Long, we can use
DecimalType.
Thanks,
Daoyuan
From: Tobias Pfeiffer
i want compute RDD[(String, Set[String])] that include a part of large size
’Set[String]’.
--
val hoge: RDD[(String, Set[String])] = ...
val reduced = hoge.reduceByKey(_ ++ _) //= create large size Set (shuffle
read size 7GB)
val counted = reduced.map{ case (key, strSeq) =
Sean,
A related question. When to persist the RDD after step 2 or after Step
3 (nothing would happen before step 3 I assume)?
On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote:
From the OP:
(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Filter out
Instead of counted.saveAsText(“/path/to/save/dir) if you call
counted.collect what happens ?
If you still face the same issue please paste the stacktrace here.
--
View this message in context:
Have you seen these threads ?
http://search-hadoop.com/m/JW1q5tMFlb
http://search-hadoop.com/m/JW1q5dabji1
Cheers
On Mon, Jan 19, 2015 at 8:33 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi Ted,
When I am running the same job with small data, I am able to run. But when
I run it with
Hi,
On Sun, Jan 18, 2015 at 11:08 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Driver programs submitted by the spark-submit script will get the runtime
spark master URL, but how it get the URL inside the main method when
creating the SparkConf object?
The master will be stored in the
As far as I know, the tasks before calling saveAsText are transformations so
that they are lazy computed. Then saveAsText action performs all
transformations and your Set[String] grows up at this time. It creates large
collection if you have few keys and this causes OOM easily when your
executor
Yes we tried the master branch (sometime last week) and there was no issue,
but the above repro is for branch 1.2 and Hive 0.13. Isn't that the final
release branch for Spark 1.2?
If so, a patch needs to be created or back-ported from master?
(Yes the obvious typo in the column name was
Hi,
On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
The second parameter of jsonRDD is the sampling ratio when we infer schema.
OK, I was aware of this, but I guess I understand the problem now. My
sampling ratio is so low that I only see the Long values of data
Hi,
I am using Spark on AWS and want to write the output to S3. It is a
relatively small file and I don't want them to output as multiple parts. So
I use
result.repartition(1).saveAsTextFile(s3://...)
However as long as I am using the saveAsTextFile method, the output doesn't
keep the original
If you're talking about filter pushdowns for parquet files this also has to
be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's
off by default
On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote:
Yes it works!
But the filter can't pushdown!!!
If
Hi,
I am trying to aggregate a key based on some timestamp, and I believe that
spilling to disk is changing the order of the data fed into the combiner.
I have some timeseries data that is of the form: (key, date, other
data)
Partition 1
(A, 2, ...)
(B, 4, ...)
(A, 1, ...)
When you repartiton, ordering can get lost. You would need to sort after
repartitioning.
Aniket
On Tue, Jan 20, 2015, 7:08 AM anny9699 anny9...@gmail.com wrote:
Hi,
I am using Spark on AWS and want to write the output to S3. It is a
relatively small file and I don't want them to output as
Was there another instance of Spark running on the same machine ?
Can you pastebin the full stack trace ?
Cheers
On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I am running a Spark job. I get the output correctly but when I see the
logs file I see the
Hi Jon, I am looking for an answer for a similar question in the doc now, so
far no clue.
I would need to know what is spark behaviour in a situation like the example
you provided, but taking into account also that there are multiple
partitions/workers.
I could imagine it's possible that
Hi all,
I'm trying to implement a pipeline for computer vision based on the latest
ML package in spark. The first step of my pipeline is to decode image (jpeg
for instance) stored in a parquet file.
For this, I begin to create a UserDefinedType that represents a decoded
image stored in a array of
Hello Mixtou, if you want to look at partition ID, I believe you want to use
mapPartitionsWithIndex
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-Question-on-How-Tasks-are-Executed-tp21064p21228.html
Sent from the Apache Spark User List mailing
Yes it works!
But the filter can't pushdown!!!
If custom parquetinputformat only implement the datasource API?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com:
Thanks
In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.
(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.
Regards,
Hi all!
I am trying to use kinesis and spark streaming together. So when I execute
program I get exception com.amazonaws.AmazonClientException: Unable to load
AWS credentials from any provider in the chain
Here is my piece of code
val credentials = new
Try this piece of code:
System.setProperty(AWS_ACCESS_KEY_ID, access_key)
System.setProperty(AWS_SECRET_KEY, secret) val streamName =
mystream val endpointUrl =
https://kinesis.us-east-1.amazonaws.com/; val kinesisClient =
new AmazonKinesisClient(new
That i want to do, get unique count for each key. so take map() or
countByKey(), not get unique count. (because duplicate string is likely to
be counted)...
--
View this message in context:
If keys are not too many,
You can do like this:
val data = List(
(A, Set(1,2,3)),
(A, Set(1,2,4)),
(B, Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()
rdd.filter(_._1 == A).flatMap(_._2).distinct.count
rdd.filter(_._1 == B).flatMap(_._2).distinct.count
rdd.unpersist()
==
data:
Deep, Yes you have another spark shell or application sticking around
somewhere. Try to inspect running processes and lookout for jave process.
And kill it.
This might be helpful
https://www.digitalocean.com/community/tutorials/how-to-use-ps-kill-and-nice-to-manage-processes-in-linux
Also, That
I don't think this has anything to do with transferring anything from
the driver, or per task. I'm talking about a singleton object in the
JVM that loads whatever you want from wherever you want and holds it
in memory once per JVM. That is, I do not think you have to use
broadcast, or even any
I have recently encountered a similar problem with Guava version collision
with Hadoop.
Isn't it more correct to upgrade Hadoop to use the latest Guava? Why are
they staying in version 11, does anyone know?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Jan 7, 2015 at 7:59
-- Forwarded message --
From: Rapelly Kartheek kartheek.m...@gmail.com
Date: Mon, Jan 19, 2015 at 3:03 PM
Subject: UnknownhostException : home
To: user@spark.apache.org user@spark.apache.org
Hi,
I get the following exception when I run my application:
Actually there is already someone on Hadoop-Common-Dev taking care of
removing the old Guava dependency
http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201501.mbox/browser
https://issues.apache.org/jira/browse/HADOOP-11470
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
Please see this thread:
http://search-hadoop.com/m/LgpTk2aVYgr/Hadoop+guava+upgradesubj=Re+Time+to+address+the+Guava+version+problem
On Jan 19, 2015, at 6:03 AM, Romi Kuntsman r...@totango.com wrote:
I have recently encountered a similar problem with Guava version collision
with Hadoop.
Hello
I use Hive on Spark and have an issue with assigning several aliases to the
output (several return values) of an UDF. I ran in several issues and ended
up with a workaround (described at the end of this message).
- Is assigning several aliases to the output of an UDF not supported by
Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.
Having done this I would then hope that GROUP BY could avoid shuffle
E.g. set up a HashPartioner on CustomerCode field so that
SELECT CustomerCode,
:44.119282 19317 sched.cpp:242] No credentials provided.
Attempting to register without authentication
I0119 02:41:44.123064 19317 sched.cpp:408] Framework registered with
20150119-003609-201523392-5050-7198-0002
15/01/19 02:41:44 INFO MesosSchedulerBackend: Registered as framework ID
20150119-003609
Hi,
I created some unit tests to test some of the functions in my project which
use Spark. However, when I used the sbt tool to build it and then ran the
sbt test, I ran into java.io.IOException: Could not create FileClient:
2015-01-19 08:50:38,1894 ERROR Client
I think it's always twice, could you provide some demo case for sometimes
the RDD1 calculated only once?
On Sat, Jan 17, 2015 at 2:37 AM, Peng Cheng pc...@uow.edu.au wrote:
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
it's not able to resolve home to an IP.
Assuming it's your local machine, add an entry in your /etc/hosts file
like and then run the program again (use sudo to edit the file)
127.0.0.1 home
On Mon, Jan 19, 2015 at 3:03 PM, Rapelly Kartheek
kartheek.m...@gmail.com wrote:
Hi,
I get the
Sorry, to be clear, you need to write hdfs:///home/ Note three
slashes; there is an empty host between the 2nd and 3rd. This is true
of most URI schemes with a host.
On Mon, Jan 19, 2015 at 9:56 AM, Rapelly Kartheek
kartheek.m...@gmail.com wrote:
Yes yes.. hadoop/etc/hadoop/hdfs-site.xml
+1 to Sean suggestion
On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen so...@cloudera.com wrote:
I bet somewhere you have a path like hdfs://home/... which would
suggest that 'home' is a hostname, when I imagine you mean it as a
root directory.
On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek
I am quoting the reply I got on this - which for some reason did not get
posted here. The suggestion in the reply below worked perfectly for me. The
error mentioned in the reply is not related (or old).
Hope this is helpful to someone.
Cheers,
BB
Hi, BB
Ideally you can do the query like:
Where can I find a good documentation on sql catalyst?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/is-there-documentation-on-spark-sql-catalyst-tp21232.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like:
hdfs://home/...
On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen so...@cloudera.com wrote:
I bet somewhere you have a path like hdfs://home/... which would
suggest that 'home' is a hostname, when I imagine you mean it as a
root
Problem we're seeing is a gradual memory leak in the driver's JVM.
Executing jobs using a long running Java app which creates relatively
short-lived SparkContext's. So our Spark drivers are created as part of
this application's JVM. We're using standalone cluster mode, spark 1.0.2
Root cause of
Is there any way to support multiple users executing SQL on one thrift
server?
I think there are some problems for spark 1.2.0, for example:
1. Start thrift server with user A
2. Connect to thrift server via beeline with user B
3. Execute “insert into table dest select … from table src”
then
The problem is clearly to do with the executor exceeding YARN
allocations, so, this can't be in local mode. He said this was running
on YARN at the outset.
On Mon, Jan 19, 2015 at 2:27 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
If you are running spark in local mode, executor
Added a JIRA to track
https://issues.apache.org/jira/browse/SPARK-5309
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21229.html
Sent from the Apache Spark User List mailing list archive at
Hi,
I get the following exception when I run my application:
karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class
org.apache.spark.examples.SimpleApp001 --deploy-mode client --master
spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar
out1.txt
log4j:WARN No such
Hi,
On Sat, Jan 17, 2015 at 3:05 AM, Shuai Zheng szheng.c...@gmail.com wrote:
Can you share more information about how do you do that? I also have
similar question about this.
Not very proud about it ;-), but here you go:
// find the number of workers available to us.
val _runCmd =
I bet somewhere you have a path like hdfs://home/... which would
suggest that 'home' is a hostname, when I imagine you mean it as a
root directory.
On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek
kartheek.m...@gmail.com wrote:
Hi,
I get the following exception when I run my application:
Actually, I don't have any entry in my /etc/hosts file with hostname:
home. Infact, I didn't use this hostname naywhere. Then why is it that
its trying to resolve this?
On Mon, Jan 19, 2015 at 3:15 PM, Ashish paliwalash...@gmail.com wrote:
it's not able to resolve home to an IP.
Assuming it's
Yeah... I made that mistake in spark/conf/spark-defaults.conf for
setting: spark.eventLog.dir.
Now it works
Thank you
Karthik
On Mon, Jan 19, 2015 at 3:29 PM, Sean Owen so...@cloudera.com wrote:
Sorry, to be clear, you need to write hdfs:///home/ Note three
slashes; there is an
+1, I too need to know.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
From the OP:
(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Filter out all rows from lines that are not of type A or B
(3) val processA = Process only the A rows from ABonly
(4) val processB = Process only the B rows from ABonly
I assume that 3 and 4 are actions, or else
For anyone who finds this later, looks like Jerry already took care of this
here: https://issues.apache.org/jira/browse/SPARK-5315
Thanks!
On Sun, Jan 18, 2015 at 10:28 PM, Shao, Saisai saisai.s...@intel.com
wrote:
Hi Jeff,
From my understanding it seems more like a bug, since
Hello
I use Hive on Spark and have an issue with assigning several aliases to the
output (several return values) of an UDF. I ran in several issues and ended
up with a workaround (described at the end of this message).
- Is assigning several aliases to the output of an UDF not supported by
Your classpath has some MapR jar.
Is that intentional ?
Cheers
On Mon, Jan 19, 2015 at 6:58 AM, Jianguo Li flyingfromch...@gmail.com
wrote:
Hi,
I created some unit tests to test some of the functions in my project
which use Spark. However, when I used the sbt tool to build it and then ran
Hi, john and david
I tried this to run them concurrently List(RDD1,RDD2,.).par.foreach{
rdd=
rdd.collect().foreach(println)
}
this was able to successfully register the task but the parallelism of the
stages is limited it was able run 4 of them some time and only one of them
some time
Keep in mind that your executors will be able to run some fixed number
of tasks in parallel, given your configuration. You should not
necessarily expect that arbitrarily many RDDs and tasks would schedule
simultaneously.
On Mon, Jan 19, 2015 at 5:34 PM, critikaled isasmani@gmail.com wrote:
I also found this issue. I have reported it as a bug
https://issues.apache.org/jira/browse/SPARK-5242 and submitted a fix. You
can find link to fixed fork in the comments on the issue page. Please vote
on the issue, hopefully guys will accept pull request faster then :)
Regards, Vladimir
On Mon,
Hi guys,
Does this issue affect 1.2.0 only or all previous releases as well?
Best Regards,
Jerry
On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote:
Yes, the problem is, I've turned the flag on.
One possible reason for this is, the parquet file supports predicate
Hi,
I've setup my first spark cluster (1 master, 2 workers) and an iPython
notebook server that I'm trying to setup to access the cluster. I'm running
the workers from Anaconda to make sure the python setup is correct on each
box. The iPy notebook server appears to have everything setup
If you pass spark master URL to spark-submit, you don't need to pass the
same to SparkConf object. You can create SparkConf without this property or
for that matter any other property that you pass in spark-submit.
On Sun, Jan 18, 2015 at 7:38 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
I am trying to run SparkR shell on aws
I am unable to access worker nodes webUI access.
15/01/19 19:57:17 ERROR TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
15/01/19 19:57:17 ERROR TaskSchedulerImpl: Lost an executor 1 (already
removed): remote Akka
Have you seen this thread ?
http://search-hadoop.com/m/JW1q5PgA7X
What Spark release are you running ?
Cheers
On Mon, Jan 19, 2015 at 12:04 PM, suresh lanki.sur...@gmail.com wrote:
I am trying to run SparkR shell on aws
I am unable to access worker nodes webUI access.
15/01/19 19:57:17
Hi Yu,
I am able to run Spark-example's, I am unable to run SparkR example (only
Pi example is running on SparkR).
Thank you
Regards
Suresh
On Mon, Jan 19, 2015 at 3:08 PM, Ted Yu yuzhih...@gmail.com wrote:
Have you seen this thread ?
http://search-hadoop.com/m/JW1q5PgA7X
What Spark
70 matches
Mail list logo