Hi Ricky,
In your first try, you are using flatMap. It will give you a flat list of
strings. Then you are trying to map each string to a Row, which definitely
throws an exception.
Following Terry's idea, you are mapping the input to a list of arrays, each
of which contains some strings. Then you
What format does your REST server expect ?
You may have seen this:
https://www.paypal-engineering.com/2014/02/13/hello-newman-a-rest-client-for-scala/
On Fri, Aug 28, 2015 at 9:35 PM, Cassa L wrote:
> Hi,
> If I have RDD that counts something e.g.:
>
> JavaPairDStream successMsgCounts = suc
Hi,
If I have RDD that counts something e.g.:
JavaPairDStream successMsgCounts = successMsgs
.flatMap(buffer -> Arrays.asList(buffer.getType()))
.mapToPair(txnType -> new Tuple2("Success
" + txnType, 1))
.reduceByKey((count1, count2) -> count1 +
I have a workaround to the issue
As you can see from the log it is about 15 sec btw worker start and
shutdown.
The workaround might be to sleep 30 sec, check if worker is running and if
not try to start-slave again
part of emr spark bootstrap py script
spark_master = "spark://...:7077"
...
curl
I'm curious where the factor of 6-8 comes from ? Is this assuming snappy
(or lzf) compression ? The sizes I mentioned are what the Spark UI reports,
not sure if those are before or after compression (for the shuffle
read/write).
On Fri, Aug 28, 2015 at 4:41 PM, java8964 wrote:
> There are severa
There are several possibilities here.
1) Keep in mind that 7GB data will need way more than 7G heap, as deserialize
java object needs much more space than data itself. Grand rule is multiple 6 to
8 times, so 7G data need 50G heap space.2) You should monitor the Spark UI, to
check how many record
It really depends on the code. I would say that the easiest way is to
restart the problematic action, find the straggler task and analyze whats
happening with it with jstack / make a heap dump and analyze locally. For
example, there might be the case that your tasks are connecting to some
external
Hi Jason,
Thanks for the response. I believe I can look into a Redis based solution
for storing this state externally. However, would it be possible to refresh
this from the store with every batch i.e. what code can be written inside
the pipeline to fetch this info from the external store? Also, s
Ahh yes, thanks for mentioning data skew, I've run into that before as
well. The best way there is to get statistics on the distribution of your
join key. If there are a few values with drastically larger number of
values, then a reducer task will always be swamped no matter how many
reducer side p
There's no canonical way to do this as I understand. For instance, when
running under YARN, you have completely no idea where your containers would
be started. Moreover, if one of the containers would fail, it might be
restarted on another machine so the machine number might change at runtime
To c
Hi Nikunj,
Depending on what kind of stats you want to accumulate, you may want to
look into the Accumulator/Accumulable API, or if you need more control, you
can store these things in an external key-value store (HBase, redis, etc..)
and do careful updates there. Though be careful and make sure y
In my opinion aggragate+flatMap would work faster as it would make less
passes through the data. Would work like this:
import random
def agg(x,y):
x[0] += 1 if not y[1] else 0
x[1] += 1 if y[1] else 0
return x
# Source data
rdd = sc.parallelize(xrange(10), 5)
rdd2 = rdd.map(lamb
If the data is already in RDD, the easiest way to calculate min/max for
each column would be an aggregate() function. It takes 2 functions as
arguments - first is used to aggregate RDD values to your "accumulator",
the second is used to merge two accumulators. This way both min and max for
all the
Any idea what causing this error
15/08/28 21:03:03 WARN scheduler.TaskSetManager: Lost task 34.0 in stage
9.0 (TID 20, dtord01hdw0228p.dc.dotomi.net): java.lang.RuntimeException:
cannot find field message_campaign_id from
[0:error_error_error_error_error_error_error, 1:cannot_determine_schema,
2:c
Yeah, I tried with 10k and 30k and these still failed, will try with more
then. Though that is a little disappointing, it only writes ~7TB of shuffle
data which shouldn't in theory require more than 1000 reducers on my 10TB
memory cluster (~7GB of spill per reducer).
I'm now wondering if my shuffle
Hi Everybody!
Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:
https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics
The gist of the results is as follows:
Most people found spark-ec2 useful as an easy way to get
Just to confirm, is this what you are mentioning about? Is there any
example on how to set it? I believe it is for 0.8.3 version?
https://cwiki.apache.org/confluence/display/KAFKA/Multiple+Listeners+for+Kafka+Brokers
On Fri, Aug 28, 2015 at 12:52 PM, Sriharsha Chintalapani
wrote:
> You can con
Hi,
I have data written in HDFS using a custom storage handler of Hive. Can I
access that data in Spark using Spark SQL .
For example can I write a Spark SQL to access the data from a hive table in
HDFS which was created as -
CREATE TABLE custom_table_1(key int, value string)
STORED BY 'org.apac
I've wanted similar functionality too: when network IO bound (for me I was
trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
api where I wouldn't have to try guess at the proper partitioning of a
'driver' RDD for `sc.parallelize(1 to N, N).map( i=> pull the i'th chunk
from S3
+1 on Jason's suggestion.
bq. this large variable is broadcast many times during the lifetime
Please consider making this large variable more granular. Meaning, reduce
the amount of data transferred between the key value store and your app
during update.
Cheers
On Fri, Aug 28, 2015 at 12:44 PM,
Yeah, the direct api uses the simple consumer
On Fri, Aug 28, 2015 at 1:32 PM, Cassa L wrote:
> Hi I am using below Spark jars with Direct Stream API.
> spark-streaming-kafka_2.10
>
> When I look at its pom.xml, Kafka libraries that its pulling in is
>org.apache.kafka
>kafka_${scal
I had similar problems to this (reduce side failures for large joins (25bn
rows with 9bn)), and found the answer was to further up the
spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for
me, but your tables look a little denser, so you may want to go even higher.
On Thu, Aug 2
You could try using an external key value store (like HBase, Redis) and
perform lookups/updates inside of your mappers (you'd need to create the
connection within a mapPartitions code block to avoid the connection
setup/teardown overhead)?
I haven't done this myself though, so I'm just throwing th
Can we use the existing kafka spark streaming jar to connect to a kafka
server running in SSL mode?
We are fine with non SSL consumer as our kafka cluster and spark cluster
are in the same network
Thanks,
Sourabh
On Fri, Aug 28, 2015 at 12:03 PM, Gwen Shapira wrote:
> I can't speak for the Sp
Hi all,
I have the following use case that I wanted to get some insight on how to
go about doing in Spark Streaming.
Every batch is processed through the pipeline and at the end, it has to
update some statistics information. This updated info should be reusable in
the next batch of this DStream e
Hi I am using below Spark jars with Direct Stream API.
spark-streaming-kafka_2.10
When I look at its pom.xml, Kafka libraries that its pulling in is
org.apache.kafka
kafka_${scala.binary.version}
0.8.2.1
I believe this DirectStream API uses SimpleConsumer API. Can someone from
Hi,
I was going through SSL setup of Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka
However, I am also using Spark-Kafka streaming to read data from Kafka. Is
there a way to activate SSL for spark streaming API or not possible at all?
Thanks,
LCassa
Filter into true rdd and false rdd. Union true rdd and sample of false rdd.
On Aug 28, 2015 2:57 AM, "Gavin Yue" wrote:
> Hey,
>
>
> I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and
> randomly keep some Boolean:false rows. And hope in the final result, the
> negative one
Thanks, this looks better
// parse the lines of data into sensor objects
val sensorDStream = ssc.textFileStream("/stream").
map(Sensor.parseSensor)
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi <
Or you can just call describe() on the dataframe? In addition to min-max,
you'll also get the mean, and count of non-null and non-NA elements as well.
Burak
On Fri, Aug 28, 2015 at 10:09 AM, java8964 wrote:
> Or RDD.max() and RDD.min() won't work for you?
>
> Yong
>
> --
Load the textFile as an RDD. Something like this:
>
> > val file = sc.textFile("/path/to/file")
After this you can manipulate this RDD to filter texts the way you want
them :
> > val a1 = file.filter( line => line.contains("[ERROR]") )
> > val a2 = file.filter( line => line.contains("[WARN]") )
I think there is already an example for this shipped with Spark. However,
you do not benefit really from any spark functionality for this scenario.
If you want to do something more advanced you should look at Elasticsearch
or Solr
Le ven. 28 août 2015 à 16:15, Darksu a écrit :
> Hello,
>
> T
Or RDD.max() and RDD.min() won't work for you?
Yong
Subject: Re: Calculating Min and Max Values using Spark Transformations?
To: as...@wso2.com
CC: user@spark.apache.org
From: jfc...@us.ibm.com
Date: Fri, 28 Aug 2015 09:28:43 -0700
If you already loaded csv data into a dataframe, why not registe
For the exception w.r.t. ManifestFactory , there is SPARK-6497 which is
Open.
FYI
On Fri, Aug 28, 2015 at 8:25 AM, donhoff_h <165612...@qq.com> wrote:
> Hi, all
>
> I wrote a spark program which uses the Kryo serialization. When I count a
> rdd which type is RDD[(String,String)], it reported an
If you already loaded csv data into a dataframe, why not register it as a
table, and use Spark SQL
to find max/min or any other aggregates? SELECT MAX(column_name) FROM
dftable_name ... seems natural.
Hey Rishitesh,
Thats perfect thanks so much! Dont know why i didnt think of using
mapPartitions like this
Thanks,
Jem
On Fri, Aug 28, 2015 at 10:35 AM Rishitesh Mishra
wrote:
> Hi Jem,
> A simple way to get this is to use MapPartitionedRDD. Please see the below
> code. For this you need to kn
Hi, all
I wrote a spark program which uses the Kryo serialization. When I count a rdd
which type is RDD[(String,String)], it reported an Exception like the following
:
* Class is not registered: org.apache.spark.util.collection.CompactBuffer[]
* Note: To register this class use:
kryo.register(
Dear All,
I am trying to cluster 350k english text phrases (each with 4-20 words) into
50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using
Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering
results with lower number of features in HashingTF, the cluste
Yes, for example "val sensorRDD = rdd.map(Sensor.parseSensor)" is a
line of code executed on the driver; it's part the function you
supplied to foreachRDD. However that line defines an operation on an
RDD, and the map function you supplied (parseSensor) will ultimately
be carried out on the cluster
I think what you’ll want is to carry out the .map functions before the
foreachRDD, something like:
val lines =
ssc.textFileStream("/stream").map(Sensor.parseSensor).map(Sensor.convertToPut)
lines.foreachRDD { rdd =>
// parse the line of data into sensor object
rdd.saveAsHadoo
I would like to make sure that I am using the DStream foreachRDD
operation correctly. I would like to read from a DStream transform the
input and write to HBase. The code below works , but I became confused
when I read "Note that the function *func* is executed in the driver
process" ?
val
Hello,
This my first post, so i would like to congratulate the spark team for the
great work!
In short i have been studying Spark for the past week in order to create a
feasibility project.
The main goal of the project is to process text documents (word count will
not be over 200 words) in order
Yes, we will open up APIs in next release. There were some discussion about
the APIs. One approach is to have multiple methods for different outputs
like predicted class and probabilities. -Xiangrui
On Aug 28, 2015 6:39 AM, "Jaonary Rabarisoa" wrote:
> Hi there,
>
> The actual API of ml.Transform
Hi there,
The actual API of ml.Transformer use only DataFrame as input. I have a use
case where I need to transform a single element. For example transforming
an element from spark-streaming. Is there any reason for this or the
ml.Transformer will support transforming a single element later ?
Che
Sounds good. It's a request I have seen a few times in the past and have
needed it personally. May be Joseph Bradley has something to add.
I think a JIRA to capture this will be great. We can move this discussion
to the JIRA then.
On Friday, August 28, 2015, Cody Koeninger wrote:
> I wrote some
This is great and much appreciated. Thank you.
- Jim
From: Manish Amde [mailto:manish...@gmail.com]
Sent: Friday, August 28, 2015 9:20 AM
To: Cody Koeninger
Cc: Murphy, James; user@spark.apache.org; d...@spark.apache.org
Subject: Re: Feedback: Feature request
Sounds good. It's a request I have se
I wrote some code for this a while back, pretty sure it didn't need access
to anything private in the decision tree / random forest model. If people
want it added to the api I can put together a PR.
I think it's important to have separately parseable operators / operands
though. E.g
"lhs":0,"op
Yes, absolutely. Take a look at:
https://spark.apache.org/docs/1.4.1/mllib-statistics.html#summary-statistics
On Fri, Aug 28, 2015 at 8:39 AM, ashensw wrote:
> Hi all,
>
> I have a dataset which consist of large number of features(columns). It is
> in csv format. So I loaded it into a spark data
Hi all,
I have a dataset which consist of large number of features(columns). It is
in csv format. So I loaded it into a spark dataframe. Then I converted it
into a JavaRDD Then using a spark transformation I converted that into
JavaRDD. Then again converted it into a JavaRDD. So now
I have a JavaR
Hi,
I am trying to change the following code so as to get the probabilities of
the input Vector on each class (instead of the class itself with the
highest probability). I know that this is already available as part of the
most recent release of Spark but I have to use Spark 1.1.0.
Any help is ap
Thanks..
On Aug 28, 2015 4:55 AM, "Rishitesh Mishra"
wrote:
> Hi Swapnil,
>
> 1. All the task scheduling/retry happens from Driver. So you are right
> that a lot of communication happens from driver to cluster. It all depends
> on the how you want to go about your Spark application, whether your
Terry:
Unfortunately, error remains when use your advice.But error is changed ,now
error is java.lang.ArrayIndexOutOfBoundsException: 71
error log as following:
15/08/28 19:13:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
org.apache.
<>
This should work:
coon.filter(x => x.exists(el => Seq(1,15).contains(el)))
CompactBuffer is a specialised form of a Scala Iterator
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publication
Hi,
I am working on a Spark application that is using of a large (~3G)
broadcast variable as a lookup table. The application refines the data in
this lookup table in an iterative manner. So this large variable is
broadcast many times during the lifetime of the application process.
>From what I ha
Hi Team,
I upgraded spark older versions to 1.4.1 after maven build i tried to ran my
simple application but it failed and giving the below stacktrace.
Exception in thread "main" java.lang.NoSuchMethodError:
com.fasterxml.jackson.databind.introspect.POJOPropertyBuilder.addField(Lcom/fasterxml/jack
Hi Jem,
A simple way to get this is to use MapPartitionedRDD. Please see the below
code. For this you need to know your parent RDD's partition numbers that
you want to exclude. One drawback here is the new RDD will also invoke
similar number of tasks as parent RDDs as both the RDDs have same numbe
Ricky,
You may need to use map instead of flatMap in your case
*val rowRDD=sc.textFile("/user/spark/short_model").map(_.split("\\t")).map(p
=> Row(...))*
Thanks!
-Terry
On Fri, Aug 28, 2015 at 5:08 PM, our...@cnsuning.com
wrote:
> hi all,
>
> when using spark sql ,A problem bothering me.
I have a sparse dataset of size 775946 x 845372. I would like to perform a
grid search in order to tune the parameters of my LogisticRegressionWithSGD
model. I have noticed that the building of each model takes about 300 to
400 seconds. That means that in order to try all possible combinations of
p
hi all,
when using spark sql ,A problem bothering me.
the codeing as following:
val schemaString =
"visitor_id,cust_num,gds_id,l1_gds_group_cd,l4_gds_group_cd,pc_gds_addcart,pc_gds_collect_num,pc_gds_four_page_pv,pc_gds_four_page_time,pc_gds_four_page_fromsearch_pv,pc_gds_four_page_fr
Can you post roughly what you’re running as your Spark code? One issue I’ve
seen before is that passing a directory full of files as a path
“/path/to/files/” can be slow, while “/path/to/files/*” runs fast.
Also, if you’ve not seen it, have a look at the binaryFiles call
http://spark.apache.org
Hi Swapnil,
1. All the task scheduling/retry happens from Driver. So you are right that
a lot of communication happens from driver to cluster. It all depends on
the how you want to go about your Spark application, whether your
application has direct access to Spark cluster or its routed through a
Hi Gavin,
You can increase the speed by choosing a better encoding. A little bit
of ETL goes a long way.
e.g. As you're working with Spark SQL you probably have a tabular
format. So you could use CSV so you don't need to parse the field names
on each entry (and it will also reduce the file s
500 each with 8GB memory.
I did the test again on the cluster.
I have 6000 files which generates 6000 tasks. Each task takes 1.5 min to
finish based on the Stats.
So theoretically it should take 15 mins roughly. WIth some additinal
overhead, it totally takes 18 mins.
Based on the local file pa
Hi,
I am trying to create an RDD from a selected number of its parents
partitions. My current approach is to create my own SelectedPartitionRDD
and implement compute and numPartitions myself, problem is the compute
method is marked as @developerApi, and hence unsuitable for me to be using
in my ap
64 matches
Mail list logo