Try this piece of code:
System.setProperty("AWS_ACCESS_KEY_ID", "access_key")
System.setProperty("AWS_SECRET_KEY", "secret") val streamName =
"mystream" val endpointUrl =
"https://kinesis.us-east-1.amazonaws.com/"; val kinesisClient =
new AmazonKinesisClient(new DefaultAWSCred
Hi all!
I am trying to use kinesis and spark streaming together. So when I execute
program I get exception com.amazonaws.AmazonClientException: Unable to load
AWS credentials from any provider in the chain
Here is my piece of code
val credentials = new
BasicAWSCredentials(KinesisProperties.AWS_
Deep, Yes you have another spark shell or application sticking around
somewhere. Try to inspect running processes and lookout for jave process.
And kill it.
This might be helpful
https://www.digitalocean.com/community/tutorials/how-to-use-ps-kill-and-nice-to-manage-processes-in-linux
Also, That
If keys are not too many,
You can do like this:
val data = List(
("A", Set(1,2,3)),
("A", Set(1,2,4)),
("B", Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()
rdd.filter(_._1 == "A").flatMap(_._2).distinct.count
rdd.filter(_._1 == "B").flatMap(_._2).distinct.count
rdd.unpersist()
That i want to do, get unique count for each key. so take map() or
countByKey(), not get unique count. (because duplicate string is likely to
be counted)...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-la
In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.
(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.
Regards,
Kev
I just checked the post. do you need help still ?
I think getAs(Seq[String]) should help.
If you are still stuck let me know.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html
Sent from t
Yes, I have increased the driver memory in spark-default.conf to 2g. Still
the error persists.
On Tue, Jan 20, 2015 at 10:18 AM, Ted Yu wrote:
> Have you seen these threads ?
>
> http://search-hadoop.com/m/JW1q5tMFlb
> http://search-hadoop.com/m/JW1q5dabji1
>
> Cheers
>
> On Mon, Jan 19, 2015 at
As far as I know, the tasks before calling saveAsText are transformations so
that they are lazy computed. Then saveAsText action performs all
transformations and your Set[String] grows up at this time. It creates large
collection if you have few keys and this causes OOM easily when your
executor m
Instead of counted.saveAsText(“/path/to/save/dir") if you call
counted.collect what happens ?
If you still face the same issue please paste the stacktrace here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-inclu
Have you seen these threads ?
http://search-hadoop.com/m/JW1q5tMFlb
http://search-hadoop.com/m/JW1q5dabji1
Cheers
On Mon, Jan 19, 2015 at 8:33 PM, Deep Pradhan
wrote:
> Hi Ted,
> When I am running the same job with small data, I am able to run. But when
> I run it with relatively bigger set of
I closed the Spark Shell and tried but no change.
Here is the error:
.
15/01/17 14:33:39 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:59791
15/01/17 14:33:39 INFO Server: jetty-8.y.z-SNAPSHOT
15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:40
I had the Spark Shell running through out. Is it because of that?
On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu wrote:
> Was there another instance of Spark running on the same machine ?
>
> Can you pastebin the full stack trace ?
>
> Cheers
>
> On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan
> wrote:
Was there another instance of Spark running on the same machine ?
Can you pastebin the full stack trace ?
Cheers
On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan
wrote:
> Hi,
> I am running a Spark job. I get the output correctly but when I see the
> logs file I see the following:
> AbstractLifeC
Hi,
I am running a Spark job. I get the output correctly but when I see the
logs file I see the following:
AbstractLifeCycle: FAILED.: java.net.BindException: Address already in
use...
What could be the reason for this?
Thank You
i want compute RDD[(String, Set[String])] that include a part of large size
’Set[String]’.
--
val hoge: RDD[(String, Set[String])] = ...
val reduced = hoge.reduceByKey(_ ++ _) //<= create large size Set (shuffle
read size 7GB)
val counted = reduced.map{ case (key, strSeq) => s”$key\t${
Sean,
A related question. When to persist the RDD after step 2 or after Step
3 (nothing would happen before step 3 I assume)?
On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen wrote:
> From the OP:
>
> (1) val lines = Import full dataset using sc.textFile
> (2) val ABonly = Filter out all rows from "li
When you repartiton, ordering can get lost. You would need to sort after
repartitioning.
Aniket
On Tue, Jan 20, 2015, 7:08 AM anny9699 wrote:
> Hi,
>
> I am using Spark on AWS and want to write the output to S3. It is a
> relatively small file and I don't want them to output as multiple parts.
Yes, actually that is what I mean exactly. And maybe you missed my last
response, you can use the API:
jsonRDD(json:RDD[String], schema:StructType)
to clearly clarify your schema. For numbers bigger than Long, we can use
DecimalType.
Thanks,
Daoyuan
From: Tobias Pfeiffer [mailto:t...@preferred
Hi,
I am using Spark on AWS and want to write the output to S3. It is a
relatively small file and I don't want them to output as multiple parts. So
I use
result.repartition(1).saveAsTextFile("s3://...")
However as long as I am using the saveAsTextFile method, the output doesn't
keep the original
Yes we tried the master branch (sometime last week) and there was no issue,
but the above repro is for branch 1.2 and Hive 0.13. Isn't that the final
release branch for Spark 1.2?
If so, a patch needs to be created or back-ported from master?
(Yes the obvious typo in the column name was introduce
Hi,
On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan
wrote:
> The second parameter of jsonRDD is the sampling ratio when we infer schema.
>
OK, I was aware of this, but I guess I understand the problem now. My
sampling ratio is so low that I only see the Long values of data items and
infer it's a
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set
*spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter
is lose efficacy!!!
2015-01-20 6:52 GMT+08:00 Yana Kadiyska :
> If you're talking about filter pushdowns for parquet files this also has
> to be tur
Hi,
On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng wrote:
> I'm talking about RDD1 (not persisted or checkpointed) in this situation:
>
> ...(somewhere) -> RDD1 -> RDD2
> ||
> V V
>
Hi,
On Sun, Jan 18, 2015 at 11:08 AM, guxiaobo1982 wrote:
>
> Driver programs submitted by the spark-submit script will get the runtime
> spark master URL, but how it get the URL inside the main method when
> creating the SparkConf object?
>
The master will be stored in the spark.master property
If you're talking about filter pushdowns for parquet files this also has to
be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's
off by default
On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wrote:
> Yes it works!
> But the filter can't pushdown!!!
>
> If custom parquetin
Hi,
I am trying to aggregate a key based on some timestamp, and I believe that
spilling to disk is changing the order of the data fed into the combiner.
I have some timeseries data that is of the form: ("key", "date", "other
data")
Partition 1
("A", 2, ...)
("B", 4, ...)
("A", 1,
Hi Yu,
I am able to run Spark-example's, I am unable to run SparkR example (only
Pi example is running on SparkR).
Thank you
Regards
Suresh
On Mon, Jan 19, 2015 at 3:08 PM, Ted Yu wrote:
> Have you seen this thread ?
> http://search-hadoop.com/m/JW1q5PgA7X
>
> What Spark release are you runn
I am running Spark 1.2.0 version
On Mon, Jan 19, 2015 at 3:08 PM, Ted Yu wrote:
> Have you seen this thread ?
> http://search-hadoop.com/m/JW1q5PgA7X
>
> What Spark release are you running ?
>
> Cheers
>
> On Mon, Jan 19, 2015 at 12:04 PM, suresh wrote:
>
>> I am trying to run SparkR shell on a
Have you seen this thread ?
http://search-hadoop.com/m/JW1q5PgA7X
What Spark release are you running ?
Cheers
On Mon, Jan 19, 2015 at 12:04 PM, suresh wrote:
> I am trying to run SparkR shell on aws
>
> I am unable to access worker nodes webUI access.
>
> 15/01/19 19:57:17 ERROR TaskSchedulerI
I am trying to run SparkR shell on aws
I am unable to access worker nodes webUI access.
15/01/19 19:57:17 ERROR TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
15/01/19 19:57:17 ERROR TaskSchedulerImpl: Lost an executor 1 (already
removed): remote Akka cl
Hi guys,
Does this issue affect 1.2.0 only or all previous releases as well?
Best Regards,
Jerry
On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao wrote:
>
> Yes, the problem is, I've turned the flag on.
>
> One possible reason for this is, the parquet file supports "predicate
> pushdown" by settin
I also found this issue. I have reported it as a bug
https://issues.apache.org/jira/browse/SPARK-5242 and submitted a fix. You
can find link to fixed fork in the comments on the issue page. Please vote
on the issue, hopefully guys will accept pull request faster then :)
Regards, Vladimir
On Mon,
Hi,
I've setup my first spark cluster (1 master, 2 workers) and an iPython
notebook server that I'm trying to setup to access the cluster. I'm running
the workers from Anaconda to make sure the python setup is correct on each
box. The iPy notebook server appears to have everything setup correctly,
If you pass spark master URL to spark-submit, you don't need to pass the
same to SparkConf object. You can create SparkConf without this property or
for that matter any other property that you pass in spark-submit.
On Sun, Jan 18, 2015 at 7:38 AM, guxiaobo1982 wrote:
> Hi,
>
> Driver programs su
Keep in mind that your executors will be able to run some fixed number
of tasks in parallel, given your configuration. You should not
necessarily expect that arbitrarily many RDDs and tasks would schedule
simultaneously.
On Mon, Jan 19, 2015 at 5:34 PM, critikaled wrote:
> Hi, john and david
> I
Your classpath has some MapR jar.
Is that intentional ?
Cheers
On Mon, Jan 19, 2015 at 6:58 AM, Jianguo Li
wrote:
> Hi,
>
> I created some unit tests to test some of the functions in my project
> which use Spark. However, when I used the sbt tool to build it and then ran
> the "sbt test", I ra
Hi, john and david
I tried this to run them concurrently List(RDD1,RDD2,.).par.foreach{
rdd=>
rdd.collect().foreach(println)
}
this was able to successfully register the task but the parallelism of the
stages is limited it was able run 4 of them some time and only one of them
some time whic
For anyone who finds this later, looks like Jerry already took care of this
here: https://issues.apache.org/jira/browse/SPARK-5315
Thanks!
On Sun, Jan 18, 2015 at 10:28 PM, Shao, Saisai
wrote:
> Hi Jeff,
>
>
>
> From my understanding it seems more like a bug, since JavaDStreamLike is
> used f
Hello
I use Hive on Spark and have an issue with assigning several aliases to the
output (several return values) of an UDF. I ran in several issues and ended
up with a workaround (described at the end of this message).
- Is assigning several aliases to the output of an UDF not supported by
Sp
Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.
Having done this I would then hope that GROUP BY could avoid shuffle
E.g. set up a HashPartioner on CustomerCode field so that
SELECT CustomerCode, SU
I think it's always twice, could you provide some demo case for sometimes
the RDD1 calculated only once?
On Sat, Jan 17, 2015 at 2:37 AM, Peng Cheng wrote:
> I'm talking about RDD1 (not persisted or checkpointed) in this situation:
>
> ...(somewhere) -> RDD1 -> RDD2
>
Hi,
I created some unit tests to test some of the functions in my project which
use Spark. However, when I used the sbt tool to build it and then ran the
"sbt test", I ran into "java.io.IOException: Could not create FileClient":
2015-01-19 08:50:38,1894 ERROR Client fs/client/fileclient/cc/client
Hello
I use Hive on Spark and have an issue with assigning several aliases to the
output (several return values) of an UDF. I ran in several issues and ended
up with a workaround (described at the end of this message).
- Is assigning several aliases to the output of an UDF not supported by
Spar
1:44.118728 19317 sched.cpp:234] New master detected at
master@192.0.3.12:5050
I0119 02:41:44.119282 19317 sched.cpp:242] No credentials provided.
Attempting to register without authentication
I0119 02:41:44.123064 19317 sched.cpp:408] Framework registered with
20150119-003609-201523392-5050-7198-000
Please see this thread:
http://search-hadoop.com/m/LgpTk2aVYgr/Hadoop+guava+upgrade&subj=Re+Time+to+address+the+Guava+version+problem
> On Jan 19, 2015, at 6:03 AM, Romi Kuntsman wrote:
>
> I have recently encountered a similar problem with Guava version collision
> with Hadoop.
>
> Isn't it
Actually there is already someone on Hadoop-Common-Dev taking care of
removing the old Guava dependency
http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201501.mbox/browser
https://issues.apache.org/jira/browse/HADOOP-11470
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
O
I have recently encountered a similar problem with Guava version collision
with Hadoop.
Isn't it more correct to upgrade Hadoop to use the latest Guava? Why are
they staying in version 11, does anyone know?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Jan 7, 2015 at 7:59
Hi,
Could you please describing your monitored symptom a little more specifically,
like is there any exceptions you monitored with Kafka, Zookeeper, what other
exceptions from Spark side you monitored? With only this short description can
hardly find any clue.
Thanks
Jerry
From: Eduardo Alfai
For our use case we need to implement DROOLS, there is no guides or tutorial,
How we can implement DROOLS in spark? If someone has some idea how we can
setup please guide me..
Thanks in Advance.
Regards,
Vishnu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.co
>From the OP:
(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Filter out all rows from "lines" that are not of type A or B
(3) val processA = Process only the A rows from ABonly
(4) val processB = Process only the B rows from ABonly
I assume that 3 and 4 are actions, or els
+1, I too need to know.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Where can I find a good documentation on sql catalyst?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/is-there-documentation-on-spark-sql-catalyst-tp21232.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Problem we're seeing is a gradual memory leak in the driver's JVM.
Executing jobs using a long running Java app which creates relatively
short-lived SparkContext's. So our Spark drivers are created as part of
this application's JVM. We're using standalone cluster mode, spark 1.0.2
Root cause of t
Yeah... I made that mistake in spark/conf/spark-defaults.conf for
setting: spark.eventLog.dir.
Now it works
Thank you
Karthik
On Mon, Jan 19, 2015 at 3:29 PM, Sean Owen wrote:
> Sorry, to be clear, you need to write "hdfs:///home/...". Note three
> slashes; there is an empty host betw
The problem is clearly to do with the executor exceeding YARN
allocations, so, this can't be in local mode. He said this was running
on YARN at the outset.
On Mon, Jan 19, 2015 at 2:27 AM, Raghavendra Pandey
wrote:
> If you are running spark in local mode, executor parameters are not used as
> th
I don't think this has anything to do with transferring anything from
the driver, or per task. I'm talking about a singleton object in the
JVM that loads whatever you want from wherever you want and holds it
in memory once per JVM. That is, I do not think you have to use
broadcast, or even any Spar
I am quoting the reply I got on this - which for some reason did not get
posted here. The suggestion in the reply below worked perfectly for me. The
error mentioned in the reply is not related (or old).
Hope this is helpful to someone.
Cheers,
BB
> Hi, BB
>Ideally you can do the query like:
Sorry, to be clear, you need to write "hdfs:///home/...". Note three
slashes; there is an empty host between the 2nd and 3rd. This is true
of most URI schemes with a host.
On Mon, Jan 19, 2015 at 9:56 AM, Rapelly Kartheek
wrote:
> Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like:
+1 to Sean suggestion
On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen wrote:
> I bet somewhere you have a path like "hdfs://home/..." which would
> suggest that 'home' is a hostname, when I imagine you mean it as a
> root directory.
>
> On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek
> wrote:
>> Hi
Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like:
"hdfs://home/..."
On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen wrote:
> I bet somewhere you have a path like "hdfs://home/..." which would
> suggest that 'home' is a hostname, when I imagine you mean it as a
> root directory.
>
> On
I bet somewhere you have a path like "hdfs://home/..." which would
suggest that 'home' is a hostname, when I imagine you mean it as a
root directory.
On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek
wrote:
> Hi,
>
> I get the following exception when I run my application:
>
> karthik@karthik:~/s
Actually, I don't have any entry in my /etc/hosts file with hostname:
"home". Infact, I didn't use this hostname naywhere. Then why is it that
its trying to resolve this?
On Mon, Jan 19, 2015 at 3:15 PM, Ashish wrote:
> it's not able to resolve home to an IP.
> Assuming it's your local machine,
it's not able to resolve home to an IP.
Assuming it's your local machine, add an entry in your /etc/hosts file
like and then run the program again (use sudo to edit the file)
127.0.0.1 home
On Mon, Jan 19, 2015 at 3:03 PM, Rapelly Kartheek
wrote:
> Hi,
>
> I get the following exception when I ru
-- Forwarded message --
From: Rapelly Kartheek
Date: Mon, Jan 19, 2015 at 3:03 PM
Subject: UnknownhostException : home
To: "user@spark.apache.org"
Hi,
I get the following exception when I run my application:
karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class
org.apache.
Hi,
On Sat, Jan 17, 2015 at 3:05 AM, Shuai Zheng wrote:
>
> Can you share more information about how do you do that? I also have
> similar question about this.
>
Not very proud about it ;-), but here you go:
// find the number of workers available to us.
val _runCmd = scala.util.Properties.prop
Hi,
I get the following exception when I run my application:
karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class
org.apache.spark.examples.SimpleApp001 --deploy-mode client --master
spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar
>out1.txt
log4j:WARN No such propert
Added a JIRA to track
https://issues.apache.org/jira/browse/SPARK-5309
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21229.html
Sent from the Apache Spark User List mailing list archive at Nabble.co
Is there any way to support multiple users executing SQL on one thrift
server?
I think there are some problems for spark 1.2.0, for example:
1. Start thrift server with user A
2. Connect to thrift server via beeline with user B
3. Execute “insert into table dest select … from table src”
then w
Hi all,
I'm trying to implement a pipeline for computer vision based on the latest
ML package in spark. The first step of my pipeline is to decode image (jpeg
for instance) stored in a parquet file.
For this, I begin to create a UserDefinedType that represents a decoded
image stored in a array of
Yes it works!
But the filter can't pushdown!!!
If custom parquetinputformat only implement the datasource API?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
2015-01-16 21:51 GMT+08:00 Xiaoyu Wang :
> Thanks yana!
> I will try i
Hello Mixtou, if you want to look at partition ID, I believe you want to use
mapPartitionsWithIndex
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-Question-on-How-Tasks-are-Executed-tp21064p21228.html
Sent from the Apache Spark User List mailing li
Hi Jon, I am looking for an answer for a similar question in the doc now, so
far no clue.
I would need to know what is spark behaviour in a situation like the example
you provided, but taking into account also that there are multiple
partitions/workers.
I could imagine it's possible that differen
73 matches
Mail list logo