Have a look at
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
Thanks
Best Regards
On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:
>
> Hi All,
>
> I have written a spark program on my dev box ,
>IDE:Intellij
>
Isn't it what tempRDD.groupByKey does?
Thanks
Best Regards
On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh
wrote:
> Hi All,
>
> I have an RDD having the data in the following form :
>
> tempRDD: RDD[(String, (String, String))]
>
> (brand , (product, key))
>
>
Looks like the winutils.exe is missing from the environment, See
https://issues.apache.org/jira/browse/SPARK-2356
Thanks
Best Regards
On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman wrote:
> Hi,
>
> i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
>
You can achieve this with the normal RDD way. Have one extra stage in the
pipeline where you will properly standardize all the values (like replacing
doc with doctor) for all the columns before the join.
Thanks
Best Regards
On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh
This?
http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-close-of-PR-s-td15862.html
Thanks
Best Regards
On Mon, Feb 22, 2016 at 2:47 PM, Sean Owen wrote:
> I know Patrick told us at some point, but I can't find the email or
> wiki that describes how to run
Does it support over? I couldn't find it in the documentation
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
Thanks
Best Regards
On Fri, Jan 22, 2016 at 2:31 PM, 汪洋 wrote:
> I think it cannot be right.
>
> 在 2016年1月22日,下午4:53,汪洋
Have a look at the TPC-H queries, I found this repository with the quries.
https://github.com/ssavvides/tpch-spark
Thanks
Best Regards
On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa
wrote:
> Hi,
> I have downloaded the Amplab benchmark dataset from
>
If the port 7077 is open for public on your cluster, that's all you need to
take over the cluster. You can read a bit about it here
https://www.sigmoid.com/securing-apache-spark-cluster/
You can also look at this small exploit I wrote
https://www.exploit-db.com/exploits/36562/
Thanks
Best
You can pretty much measure it from the Event timeline listed in the driver
ui, You can click on jobs/tasks and get the time that it took for each of
it from there.
Thanks
Best Regards
On Thu, Dec 17, 2015 at 7:27 AM, sara mustafa
wrote:
> Hi,
>
> The class
Not quiet sure whats happening, but its not an issue with multiplication i
guess as the following query worked for me:
trades.select(trades("price")*9.5).show
+-+
|(price * 9.5)|
+-+
|199.5|
|228.0|
|190.0|
|199.5|
|190.0|
|
Is that all you have in the executor logs? I suspect some of those jobs are
having a hard time managing the memory.
Thanks
Best Regards
On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote:
> [adding dev list since it's probably a bug, but i'm not sure how to
> reproduce so I
You can read the installation details from here
http://spark.apache.org/docs/latest/
You can read about contributing to spark from here
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Thanks
Best Regards
On Thu, Oct 29, 2015 at 3:53 PM, Aaska Shah
You can't create a new RDD by selecting few elements. A rdd.take(n),
takeSample etc are actions and it will trigger your entire pipeline to be
executed.
You can although do something like this i guess:
val sample_data = rdd.take(10)
val sample_rdd = sc.parallelize(sample_data)
Thanks
Best
ntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Is that all you have in the executor logs? I suspect some of those jobs
>> are having a hard time managing the memory.
>
Can you paste the contents of your spark-env.sh file? Also would be good to
have a look at the /etc/hosts file. Cannot bind to the given ip address can
be resolved if you put the hostname instead of the ip address. Also make
sure the configuration (conf directory) across your cluster have the same
I guess the order is guaranteed unless you set
the spark.streaming.concurrentJobs to a higher number than 1.
Thanks
Best Regards
On Mon, Oct 19, 2015 at 12:28 PM, Renjie Liu
wrote:
> Hi, all:
> I've read source code and it seems that there is no guarantee that the
>
For some reason the executors are getting killed,
15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
app-20150929120924-/24463 is now EXITED (Command exited with code 1)
Can you paste your spark-submit command? You can also look in the executor
logs and see whats going on.
You can create a JavaRDD as normal and then call the .rdd() to get the RDD.
Thanks
Best Regards
On Mon, Sep 28, 2015 at 9:01 PM, Rohith P
wrote:
> Hi all,
> I am trying to work with spark-redis connector (redislabs) which
> requires all transactions between
Send an email to dev-unsubscr...@spark.apache.org instead of
dev@spark.apache.org
Thanks
Best Regards
On Fri, Sep 25, 2015 at 4:00 PM, Nirmal R Kumar
wrote:
>
You should consider upgrading your spark from 1.3.0 to a higher version.
Thanks
Best Regards
On Mon, Sep 14, 2015 at 2:28 PM, Priya Ch
wrote:
> Hi All,
>
> I came across the related old conversation on the above issue (
>
I found an old JIRA referring the same.
https://issues.apache.org/jira/browse/SPARK-5421
Thanks
Best Regards
On Sun, Sep 6, 2015 at 8:53 PM, Madhu wrote:
> I'm not sure if this has been discussed already, if so, please point me to
> the thread and/or related JIRA.
>
> I have
Or you can increase the driver heap space (export _JAVA_OPTIONS="-Xmx5g")
Thanks
Best Regards
On Wed, Sep 2, 2015 at 11:57 PM, Mike Hynes <91m...@gmail.com> wrote:
> Just a thought; this has worked for me before on standalone client
> with a similar OOM error in a driver thread. Try setting:
>
Why not attach a bigger hard disk to the machines and point your
SPARK_LOCAL_DIRS to it?
Thanks
Best Regards
On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti
wrote:
> Hello,
>
> Similar to the thread below [1], when I tried to create an RDD from a 4GB
> pandas dataframe
You can add it to the spark packages i guess http://spark-packages.org/
Thanks
Best Regards
On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai pishe...@gmail.com wrote:
Sorry for previous line-breaking format, try to resend the mail again.
I have written a sbt plugin called spark-deployer, which
PM, Imran Rashid iras...@cloudera.com wrote:
oh I see, you are defining your own RDD Partition types, and you had a
bug where partition.index did not line up with the partitions slot in
rdd.getPartitions. Is that correct?
On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das ak...@sigmoidanalytics.com
is to show overlapping partitions, duplicates. index to partition
mismatch - that sort of thing.
On Thu, Aug 13, 2015 at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yep, and it works fine for operations which does not involve any shuffle
(like foreach,, count etc) and those which
Have a look at spark.shuffle.manager, You can switch between sort and hash
with this configuration.
spark.shuffle.managersortImplementation to use for shuffling data. There
are two implementations available:sort and hash. Sort-based shuffle is more
memory-efficient and is the default option
Hi Starch,
It also depends on the applications behavior, some might not be properly
able to utilize the network. If you are using say Kafka, then one thing
that you should keep in mind is the Size of the individual message and the
number of partitions that you are having. The higher the message
You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical
Hi
My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of
I think you can start from here
https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
Thanks
Best Regards
On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:
I think the team is
You need to find the bottleneck here, it could your network (if the data is
huge) or your producer code isn't pushing at 20k/s, If you are able to
produce at 20k/s then make sure you are able to receive at that rate (try
it without spark).
Thanks
Best Regards
On Sat, Jul 25, 2015 at 3:29 PM,
likely
would it be that a change like that goes thru? Would it be rejected as an
uncommon scenario? I really don't want to have this as a separate form of
the branch.
Thanks,
Joel
--
*From:* Akhil Das ak...@sigmoidanalytics.com
*Sent:* Wednesday, July 15, 2015 2:07
This will get you started
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Thanks
Best Regards
On Mon, Jul 13, 2015 at 5:29 PM, srinivasraghavansr71
sreenivas.raghav...@gmail.com wrote:
Hello everyone,
I am interested to contribute to apache spark. I
You can try to resolve some Jira issues, to start with try out some newbie
JIRA's.
Thanks
Best Regards
On Tue, Jul 14, 2015 at 4:10 PM, srinivasraghavansr71
sreenivas.raghav...@gmail.com wrote:
I saw the contribution sections. As a new contibutor, should I try to build
patches or can I add
Can you look in the datanode logs and see whats going on? Most likely, you
are hitting the ulimit on open file handles.
Thanks
Best Regards
On Wed, Jul 8, 2015 at 10:55 AM, Pankaj Arora pankaj.ar...@guavus.com
wrote:
Hi,
I am running long running application over yarn using spark and I am
UpdatestateByKey?
Thanks
Best Regards
On Wed, Jul 8, 2015 at 1:05 AM, swetha swethakasire...@gmail.com wrote:
Hi,
Suppose I want the data to be grouped by and Id named 12345 and I have
certain amount of data coming out from one batch for 12345 and I have
data
related to 12345 coming after
Which distributed database are you referring here? Spark can connect with
almost all those databases out there (You just need to pass the
Input/Output Format classes or there are a bunch of connectors also
available).
Thanks
Best Regards
On Fri, Jun 26, 2015 at 12:07 PM, louis.hust
In the conf/slaves file, are you having the ip addresses? or the hostnames?
Thanks
Best Regards
On Sat, Jun 13, 2015 at 9:51 PM, Sea 261810...@qq.com wrote:
In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0),
why? who did it?
This is a good start, if you haven't seen this already
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Thanks
Best Regards
On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71
sreenivas.raghav...@gmail.com wrote:
Hi everyone,
I am interest to
If you look at the maven repo, you can see its from typesafe only
http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark
For sbt, you can download the sources by adding withSources() like:
libraryDependencies += org.spark-project.akka % akka-actor_2.10 %
2.3.4-spark
Are you seeing the same behavior on the driver UI? (that running on port
4040), If you click on the stage id header you can sort the stages based on
IDs.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote:
Hi folks,
When I look at the output logs for an
Yes Peter that's correct, you need to identify the processes and with that
you can pull the actual usage metrics.
Thanks
Best Regards
On Thu, May 21, 2015 at 2:52 PM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
Thanks Akhil, Ryan!
@Akhil: YARN can only tell me how much vcores my
You can either pull the high level information from your resource manager,
or if you want more control/specific information you can write a script and
pull the resource usage information from the OS. Something like this
Did you happened to have a look at this https://github.com/abashev/vfs-s3
Thanks
Best Regards
On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com
wrote:
We have a small mesos cluster and these slaves need to have a vfs setup on
them so that the slaves can pull down the data
May be you should check where exactly its throwing up permission denied
(possibly trying to write to some directory). Also you can try manually
cloning the git repo to a directory and then try opening that in eclipse.
Thanks
Best Regards
On Tue, May 12, 2015 at 3:46 PM, Chandrashekhar Kotekar
Looks like the jar you provided has some missing classes. Try this:
scalaVersion := 2.10.4
libraryDependencies ++= Seq(
org.apache.spark %% spark-core % 1.3.0,
org.apache.spark %% spark-sql % 1.3.0 % provided,
org.apache.spark %% spark-mllib % 1.3.0 % provided,
log4j % log4j %
We had a similar issue while working on one of our usecase where we were
processing at a moderate throughput (around 500MB/S). When the processing
time exceeds the batch duration, it started to throw up blocknotfound
exceptions, i made a workaround for that issue and is explained over here
Hi
With Spark streaming (all versions), when my processing delay (around 2-4
seconds) exceeds the batch duration (being 1 second) and on a decent
scale/throughput (consuming around 100MB/s on 1+2 node standalone 15GB, 4
cores each) the job will start to throw block not found exceptions when the
There's a similar issue reported over here
https://issues.apache.org/jira/browse/SPARK-6847
Thanks
Best Regards
On Tue, Apr 28, 2015 at 7:35 AM, wyphao.2007 wyphao.2...@163.com wrote:
Hi everyone, I am using val messages =
KafkaUtils.createDirectStream[String, String, StringDecoder,
I also want to add mine :/
Everyone wants to add it seems.
Thanks
Best Regards
On Fri, Apr 24, 2015 at 8:58 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I understand that. The following page
http://spark.apache.org/documentation.html has a external tutorials,blogs
section which points
There were some PR's about graphical representation with D3.js, you can
possibly see it on the github. Here's a few of them
https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3
Thanks
Best Regards
On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com
wrote:
Dear
I think you can override the SPARK_CLASSPATH with your newly built jar.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 2:28 PM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
I'm building a different version of Spark Streaming (based on a different
branch than master) in my application for
Did you try ssh tunneling instead of SOCKS?
Thanks
Best Regards
On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com
wrote:
I'm trying to figure out how I might be able to use Spark with a SOCKS
proxy. That is, my dream is to be able to write code in my IDE then run it
Can you paste the complete code?
Thanks
Best Regards
On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
Hi,
I've implemented class MyClass in MLlib that does some operation on
LabeledPoint. MyClass extends serializable, so I can map this operation on
data of
You can open a Jira issue pointing this PR to get it processed faster. :)
Thanks
Best Regards
On Sat, Feb 7, 2015 at 7:07 AM, fommil sam.halli...@gmail.com wrote:
Hi all,
I'm the author of netlib-java and I noticed that the documentation in MLlib
was out of date and misleading, so I
Here's the sbt version
https://docs.sigmoidanalytics.com/index.php/Step_by_Step_instructions_on_how_to_build_Spark_App_with_IntelliJ_IDEA
Thanks
Best Regards
On Thu, Feb 5, 2015 at 8:55 AM, Stephen Boesch java...@gmail.com wrote:
For building in intellij with sbt my mileage has varied widely:
Its the executor memory (spark.executor.memory) which you can set while
creating the spark context. By default it uses 0.6% of the executor memory
for Storage. Now, to show some memory usage, you need to cache (persist)
the RDD. Regarding the OOM Exception, you can increase the level of
, Akhil Das ak...@sigmoidanalytics.com
wrote:
My mails to the mailing list are getting rejected, have opened a Jira
issue,
can someone take a look at it?
https://issues.apache.org/jira/browse/INFRA-9032
Thanks
Best Regards
My mails to the mailing list are getting rejected, have opened a Jira
issue, can someone take a look at it?
https://issues.apache.org/jira/browse/INFRA-9032
Thanks
Best Regards
We usually run Spark in HA with the following stack:
- Apache Mesos
- Marathon - init/control system for starting, stopping, and maintaining
always-on applications.(Mainly SparkStreaming)
- Chronos - general-purpose scheduler for Mesos, supports job dependency
graphs.
- Spark Job Server -
I think you need to start your streaming job, then put the files there to
get them read. textFileStream doesn't read the existing files i believe.
Also are you sure the path is not the following? (no missing / in the
beginning?)
JavaDStreamString textStream = ssc.textFileStream(/user/
It shows nullPointerException, your data could be corrupted? Try putting a
try catch inside the operation that you are doing, Are you running the
worker process on the master node also? If not, then only 1 node will be
doing the processing. If yes, then try setting the level of parallelism and
Hi Prabeesh,
Do a export _JAVA_OPTIONS=-Xmx10g before starting the shark. Also you can
do a ps aux | grep shark and see how much memory it is being allocated,
mostly it should be 512mb, in that case increase the limit.
Thanks
Best Regards
On Fri, May 23, 2014 at 10:22 AM, prabeesh k
64 matches
Mail list logo