You get the list of all the persistet rdd using spark context...
On Aug 21, 2015 12:06 AM, Rishitesh Mishra rishi80.mis...@gmail.com
wrote:
I am not sure if you can view all RDDs in a session. Tables are maintained
in a catalogue . Hence its easier. However you can see the DAG
representation
it comes at start of each tasks when there is new data inserted in kafka.(
data inserted is very few)
kafka topic has 300 partitions - data inserted is ~10 MB.
Tasks gets failed and it retries which succeed and after certain no of fail
tasks it kills the job.
On Sat, Aug 22, 2015 at 2:08 AM,
hi Ted,
thanks for your reply, are there any other way to do this with spark 1.3?
such as write the orcfile manually in foreachPartition method?
On Sat, Aug 22, 2015 at 12:19 PM, Ted Yu yuzhih...@gmail.com wrote:
ORC support was added in Spark 1.4
See SPARK-2883
On Fri, Aug 21, 2015 at 7:36
ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you
Hi,
I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm
trying to save my data frame to parquet.
The issue I'm stuck looks like serialization tries to do pretty weird
thing: tries to write to an empty array.
The last (through stack trace) line of spark code that leads to
Hi *,
We are trying to run Spark on top of mesos using fine grained mode. While
talking to few people i came to know that running Spark job using fine
grained mode on mesos is not a good idea.
I could not find anything regarding fine grained mode getting deprecated
and also if corse grained mode
Hi,
My scenario goes like this:
I have an algorithm running in Spark streaming mode on a 4 core virtual
machine. Majority of the time, the algorithm does disk I/O and database
I/O. Question is, during the I/O, where the CPU is not considerably loaded,
is it possible to run any other task/thread
This is something of a wild guess, but I find that when executors start
disappearingfor no obvious reason, this is usually because the yarn
node-managers have decided that the containers are using too much memory and
then terminate the executors.
Unfortunately, to see evidence of this, one
I believe spark-shell -i scriptFile is there. We also use it, at least in
Spark 1.3.1.
dse spark will just wrap spark-shell command, underline it is just invoking
spark-shell.
I don't know too much about the original problem though.
Yong
Date: Fri, 21 Aug 2015 18:19:49 +0800
Subject: Re:
2015-08-21 3:17 GMT-07:00 smagadi sudhindramag...@fico.com:
teenagers .toJSON gives the json but it does not preserve the parent ids
meaning if the input was {name:Yin,
address:{city:Columbus,state:Ohio},age:20}
val x= sqlContext.sql(SELECT name, address.city, address.state ,age FROM
HI All,
Any inputs for the actual problem statement
Regards,
Satish
On Fri, Aug 21, 2015 at 5:57 PM, Jeff Zhang zjf...@gmail.com wrote:
Yong, Thanks for your reply.
I tried spark-shell -i script-file, it works fine for me. Not sure the
different with
dse spark --master local --jars
I've been able to almost halve my memory usage with no instability issues.
I lowered my storage.memoryFraction and increased my shuffle.memoryFraction
(essentially swapping them). I set spark.yarn.executor.memoryOverhead to
6GB. And I lowered executor-cores in case other jobs are using the
You had:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Maybe try:
rdd2 = RDD.reduceByKey((x,y) = x+y)
rdd2.take(3)
-Abhishek-
On Aug 20, 2015, at 3:05 AM, satish chandra j jsatishchan...@gmail.com wrote:
HI All,
I have data in RDD as mentioned below:
RDD : Array[(Int),(Int)] =
Sounds like that's happening consistently, not an occasional network
problem?
Look at the Kafka broker logs
Make sure you've configured the correct kafka broker hosts / ports (note
that direct stream does not use zookeeper host / port).
Make sure that host / port is reachable from your driver
Hi all,
I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's
native libraries, so I am not being able to use it.
Can anyone suggest on how to proceed? Hopefully I wont have to recompile
hadoop. I tried changing the --driver-library-path to point directly into lz4
stand
Have you read this ?
http://stackoverflow.com/questions/22716346/how-to-use-lz4-compression-in-linux-3-11
On Aug 21, 2015, at 6:57 AM, saif.a.ell...@wellsfargo.com
saif.a.ell...@wellsfargo.com wrote:
Hi all,
I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop’s
Hello,
I attended the Tungsten-related presentations at Spark Summit (by Josh
Rosen) and at Big Data Scala (by Matei Zaharia). Needless to say, this
project holds great promise for major performance improvements.
At Josh's talk, I heard about the use of sun.misc.Unsafe as a way of
achieving some
HI Abhishek,
I have even tried that but rdd2 is empty
Regards,
Satish
On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh
abhis...@tetrationanalytics.com wrote:
You had:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Maybe try:
rdd2 = RDD.reduceByKey((x,y) = x+y)
rdd2.take(3)
What version of Spark you are using, or comes with DSE 4.7?
We just cannot reproduce it in Spark.
yzhang@localhost$ more test.sparkval pairs =
sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) = x +
y).collectyzhang@localhost$ ~/spark/bin/spark-shell --master local -i
Hi All,
I am trying to lunch a spark ec2 cluster by running spark-ec2
--key-pair=key --identity-file=my.pem --vpc-id=myvpc --subnet-id=subnet-011
--spark-version=1.4.1 launch spark-cluster but getting following message
endless. Please help.
Warning: SSH connection error.
Thanks Sean.
So how PySpark is supported. I thought PySpark needs jdk 1.6.
Chen
On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen so...@cloudera.com wrote:
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote:
I tried to build Spark 1.4.1 on cdh 5.4.0.
That was only true until Spark 1.3. Spark 1.4 can be built with JDK7
and pyspark will still work.
On Fri, Aug 21, 2015 at 8:29 AM, Chen Song chen.song...@gmail.com wrote:
Thanks Sean.
So how PySpark is supported. I thought PySpark needs jdk 1.6.
Chen
On Fri, Aug 21, 2015 at 11:16 AM, Sean
I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
PySpark, I used JDK 1.6.
I got the following error,
[INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first)
@ spark-streaming_2.10 ---
java.lang.UnsupportedClassVersionError:
Is there any reliable way to find out the number of executors
programatically - regardless of how the job is run? A method that
preferably works for spark-standalone, yarn, mesos, regardless whether the
code runs from the shell or not?
Things that I tried and don't work:
-
No, the message never end. I have to ctrl-c out of it.
Garry
From: shahid ashraf [mailto:sha...@trialx.com]
Sent: Friday, August 21, 2015 11:13 AM
To: Garry Chen g...@cornell.edu
Cc: user@spark.apache.org
Subject: Re: Spark ec2 lunch problem
Does the cluster work at the end ?
On Fri, Aug 21,
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote:
I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
PySpark, I used JDK 1.6.
I got the following error,
[INFO] --- scala-maven-plugin:3.2.0:testCompile
Have you seen the Spark SQL paper?:
https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
On Thu, Aug 20, 2015 at 11:35 PM, Dawid Wysakowicz
wysakowicz.da...@gmail.com wrote:
Hi,
thanks for answers. I have read answers you provided, but I rather look
for some materials on the
Does the cluster work at the end ?
On Fri, Aug 21, 2015 at 8:25 PM, Garry Chen g...@cornell.edu wrote:
Hi All,
I am trying to lunch a spark ec2 cluster by running
spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc
--subnet-id=subnet-011 --spark-version=1.4.1
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote:
Hello . I am seeing some unexpected issues with achieving HDFS
data
locality. I expect the
Hi Naveen,
As I mentioned before, the code is private therefore not accessible. Just
copy and use the snippet that I sent. Copying it here again:
https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270
Hi Sateesh,
It is interesting to know , how did you determine that the Dstream runs on
a single core. Did you mean receivers?
Coming back to your question, could you not start disk io in a separate
thread, so that the sceduler can go ahead and assign other tasks ?
On 21 Aug 2015 16:06, Sateesh
You could also rename them with names
Unfortunately the API doesn't show the example of that
https://spark.apache.org/docs/latest/api/R/index.html
On Thu, Aug 20, 2015 at 7:43 PM -0700, Sun, Rui rui@intel.com wrote:
Hi,
You can create a DataFrame using load.df() with a specified schema.
You can look at the spark.streaming.concurrentJobs by default it runs a
single job. If set it to 2 then it can run 2 jobs parallely. Its an
experimental flag, but go ahead and give it a try.
On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:
Hi,
My scenario goes like this:
You can try adding a humanly readable entry in your /etc/hosts file of the
worker machine and then you can set the SPARK_LOCAL_IP pointing to this
hostname on that machines spark-env.sh file.
On Aug 21, 2015 11:57 AM, saif.a.ell...@wellsfargo.com wrote:
Hi,
Is it possible in standalone to set
You've probably hit this bug:
https://issues.apache.org/jira/browse/SPARK-7180
It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to
false and see if it goes away.
On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com
wrote:
Hi,
I'm using spark
It may happen that the version of spark-ec2 script you are using is buggy
or sometime AWS have problem provisioning machines.
On Aug 21, 2015 7:56 AM, Garry Chen g...@cornell.edu wrote:
Hi All,
I am trying to lunch a spark ec2 cluster by running
spark-ec2 --key-pair=key
That looks like you are choking your kafka machine. Do a top on the kafka
machines and see the workload, it may happen that you are spending too much
time on disk io etc.
On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:
Sounds like that's happening consistently, not an
In the test job I am running in Spark 1.3.1 in our stage cluster, I can see
following information on the application stage information:
MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7
min3.4 minGC Time11 s16 s21 s25 s54 s
From the GC output log, I can see it is
Raghavendra,
Thanks for the quick reply! I don’t think I included enough information in my
question. I am hoping to get fields that are not directly part of the
aggregation. Imagine a dataframe representing website views with a userID,
datetime, and a webpage address. How could I find the
Did you try sorting it by datetime and doing a groupBy on the userID?
On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote:
Raghavendra,
Thanks for the quick reply! I don’t think I included enough information in
my question. I am hoping to get fields that are not directly part of the
Hi Akhil,
I'm using spark 1.4.1.
Number of executors is not in the command line, not in the
getExecutorMemoryStatus
(I already mentioned that I tried that, works in spark-shell but not when
executed via spark-submit). I tried looking at defaultParallelism too,
it's 112 (7 executors * 16 cores)
Which version spark are you using? There was a discussion happened over
here
http://apache-spark-user-list.1001560.n3.nabble.com/Determine-number-of-running-executors-td19453.html
Could you periodically (say every 10 mins) run System.gc() on the driver.
The cleaning up shuffles is tied to the garbage collection.
On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma sharmagaura...@gmail.com
wrote:
Hi All,
I have a 24x7 running Streaming Process, which runs on 2 hour windowed
@Cheng, Hao : Physical plans show that it got stuck on scanning S3!
(table is partitioned by date_prefix and hour)
explain select count(*) from test_table where date_prefix='20150819' and
hour='00';
TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
TungstenExchange
Did you try with hadoop version 2.7.1 .. It is known that s3a works really
well with parquet which is available in 2.7. They fixed lot of issues
related to metadata reading there...
On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote:
@Cheng, Hao : Physical plans show that it
Thanks Reynold, that helps a lot. I'm glad you're involved with that Google
Doc community effort. I think it's because of that doc that the JEP's
wording and scope changed for the better since it originally got
introduced.
Marek
On Fri, Aug 21, 2015 at 11:18 AM, Reynold Xin r...@databricks.com
Impact,
You can group by the data and then sort it by timestamp and take max to
select the oldest value.
On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote:
I am also looking for a way to achieve the reducebykey functionality on
data
frames. In my case I need to select one particular row
I believe this was caused by some network configuration on my machines. After
installing VirtualBox, some new network interfaces were installed on the
machines and the Akka software was binding to one of the VirtualBox interfaces
and not the interface that belonged to my Ethernet card. Once I
Hi,
Is it possible in standalone to set up worker ID names? to avoid the
worker-19248891237482379-ip..-port ??
Thanks,
Saif
I am also looking for a way to achieve the reducebykey functionality on data
frames. In my case I need to select one particular row (the oldest, based on
a timestamp column value) by key.
--
View this message in context:
Nathan,
I achieve this using rowNumber. Here is a Python DataFrame example:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rowNumber
yourOutputDF = (
yourInputDF
.withColumn(first, rowNumber()
Is there a workaround without updating Hadoop? Would really appreciate if
someone can explain what spark is trying to do here and what is an easy way
to turn this off. Thanks all!
On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
Did you try with hadoop
Following is a method that retrieves the list of executors registered to a
spark context. It worked perfectly with spark-submit in standalone mode for my
project.
/** * A simplified method that just returns the current active/registered
executors * excluding the driver. * @param sc *
Hi,
Exception thrown when using Having Clause with variation or stddev. It
works perfectly when using other aggregate functions(Like
sum,count,min,max..)
SELECT SUM(1) AS `sum_number_of_records_ok` FROM
`some_db`.`some_table` `some_table`
GROUP BY 1 HAVING (STDDEV(1) 0)
SELECT SUM(1) AS
when i send the message from kafka topic having three partitions.
Spark will listen the message when i say kafkautils.createStream or
createDirectstSream have local[4]
Now i want to see if spark will create partitions when it receive
message from kafka using dstream, how and where ,prwhich method
Hi,
thanks for answers. I have read answers you provided, but I rather look for
some materials on the internals. E.g how the optimizer works, how the query
is translated into rdd operations etc. The API I am quite familiar with.
A good starting point for me was: Spark DataFrames: Simple and Fast
Hi,
I was trying to programmatically specify a schema and apply it to a RDD of
Rows and save the resulting DataFrame as a parquet file.
Here's what I did:
1. Created an RDD of Rows from RDD[Array[String]]:
val gameId= Long.valueOf(line(0))
val accountType = Long.valueOf(line(1))
val
The OP wants to understand what determines the size of the task code that is
shipped to each executor so it can run the task. I don't know the answer to but
would be interested to know too.
Sent from my iPhone
On 21 Aug 2015, at 08:26, oubrik [via Apache Spark User List]
It seems like you want simultaneous processing of multiple jobs but at the
same time serialization of few tasks within those jobs. I don't know how to
achieve that in Spark.
But, why would you bother about the inter-weaved processing when the data
that is being aggregated in different jobs is per
please try DataFrame.toJSON, it will give you an RDD of JSON string.
At 2015-08-21 15:59:43, smagadi sudhindramag...@fico.com wrote:
val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND
age = 19)
I need teenagers to be a JSON object rather a simple row .How can we get
Yes, DSE 4.7
Regards,
Satish Chandra
On Fri, Aug 21, 2015 at 3:06 PM, Robin East robin.e...@xense.co.uk wrote:
Not sure, never used dse - it’s part of DataStax Enterprise right?
On 21 Aug 2015, at 10:07, satish chandra j jsatishchan...@gmail.com
wrote:
HI Robin,
Yes, below mentioned
HI Robin,
Yes, it is DSE but issue is related to Spark only
Regards,
Satish Chandra
On Fri, Aug 21, 2015 at 3:06 PM, Robin East robin.e...@xense.co.uk wrote:
Not sure, never used dse - it’s part of DataStax Enterprise right?
On 21 Aug 2015, at 10:07, satish chandra j jsatishchan...@gmail.com
teenagers .toJSON gives the json but it does not preserve the parent ids
meaning if the input was {name:Yin,
address:{city:Columbus,state:Ohio},age:20}
val x= sqlContext.sql(SELECT name, address.city, address.state ,age FROM
people where age19 and age =30 ).toJSON
x.collect().foreach(println)
Hi All,
I have a 24x7 running Streaming Process, which runs on 2 hour windowed data
The issue i am facing is my worker machines are running OUT OF DISK space
I checked that the SHUFFLE FILES are not getting cleaned up.
Hi Satish,
I don't see where spark support -i, so suspect it is provided by DSE. In
that case, it might be bug of DSE.
On Fri, Aug 21, 2015 at 6:02 PM, satish chandra j jsatishchan...@gmail.com
wrote:
HI Robin,
Yes, it is DSE but issue is related to Spark only
Regards,
Satish Chandra
Hi
Getting below error in spark streaming 1.3 while consuming from kafka
using directkafka stream. Few of tasks are getting failed in each run.
What is the reason /solution of this error?
15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in
stage 130.0 (TID 16332)
HI Robin,
Yes, below mentioned piece or code works fine in Spark Shell but the same
when place in Script File and executed with -i file name it creating an
empty RDD
scala val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))
pairs: org.apache.spark.rdd.RDD[(Int, Int)] =
The easiest option I found to put jars in SPARK CLASSPATH
On 21 Aug 2015 06:20, Burak Yavuz brk...@gmail.com wrote:
If you would like to try using spark-csv, please use
`pyspark --packages com.databricks:spark-csv_2.11:1.2.0`
You're missing a dependency.
Best,
Burak
On Thu, Aug 20, 2015
Hi,
I was trying to programmatically specify a schema and apply it to a RDD of
Rows and save the resulting DataFrame as a parquet file, but I got
java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Long on the last step.
Here's what I did:
1. Created an RDD of Rows from
Does spark sql supports XML the same way as it supports json ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-support-for-XML-tp24382.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Have you considered asking this question on
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
?
Cheers
On Thu, Aug 20, 2015 at 10:57 PM, Samya samya.ma...@amadeus.com wrote:
Hi All,
I need to write an RDD to Cassandra using the sparkCassandraConnector
from
71 matches
Mail list logo