Can you provide some more details:
1. How many partitions does RDD have
2. How big is the cluster
On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote:
> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the
try --jars rather than --class to submit jar.
On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch java...@gmail.com wrote:
The NoClassDefFoundException differs from ClassNotFoundException : it
indicates an error while initializing that class: but the class is found in
the classpath. Please
can you tell more about your environment. I understand you are running it
on a single machine but is firewall enabled?
On Sun, Aug 16, 2015 at 5:47 AM, t4ng0 manvendra.tom...@gmail.com wrote:
Hi
I am new to spark and trying to run standalone application using
spark-submit. Whatever i could
why are you expecting footprint of dataframe to be lower when it contains
more information ( RDD + Schema)
On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote:
Hi,
With following code snippet, I cached the raw RDD(which is already in
memory, but just for illustration) and its
can you explain what transformation is failing. Here's a simple example.
http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/
On Thu, Jul 23, 2015 at 5:37 AM, saif.a.ell...@wellsfargo.com wrote:
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+
try setting --driver-class-path
On Wed, Jul 22, 2015 at 3:45 PM, roni roni.epi...@gmail.com wrote:
Hi All,
I have a cluster with spark 1.4.
I am trying to save data to mysql but getting error
Exception in thread main java.sql.SQLException: No suitable driver found
for
for me, it's only working if I set --driver-class-path to mysql library.
On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang gavin@gmail.com wrote:
OK,I found what the problem is: It couldn't work with
mysql-connector-5.0.8.
I updated the connector version to 5.1.34 and it worked.
--
View
ca you share some sample data
On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote:
Hi,
I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints
loaded using loadLibSVMFile:
val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
programmatically specifying Schema needs
import org.apache.spark.sql.type._
for StructType and StructField to resolve.
On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote:
Yes I think this was already just fixed by:
https://github.com/apache/spark/pull/4977
a .toDF() is
if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.
On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote:
You can also use your InputFormat/RecordReader in Spark, e.g. using
I am joining two tables as below, the program stalls at below log line and
never proceeds.
What might be the issue and possible solution?
INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79
Table 1 has 450 columns
Table2 has 100 columns
Both tables have few million
you can also access SparkConf using sc.getConf in Spark shell though for
StreamingContext you can directly refer sc as Akhil suggested.
On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
In the shell you could do:
val ssc = StreamingContext(*sc*, Seconds(1))
as
One approach is to first transform this RDD into a PairRDD by taking the
field you are going to do aggregation on as key
On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Hi,
I have a csv file having fields as a,b,c .
I want to do aggregation(sum,average..) based
as per my understanding RDDs do not get replicated, underlying Data does if
it's in HDFS.
On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
I want to find the time taken for replicating an rdd in spark cluster
along with the computation time on the
Hi Kevin,
Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?
What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .
On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin yun...@ebay.com wrote:
Hi,
Hi Ankit,
Optional number of partitions value is to increase number of partitions not
reduce it from default value.
On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
I am trying to read a file into a single partition but it seems like
sparkContext.textFile
Without caching, each action is recomputed. So assuming rdd2 and rdd3
result in separate actions answer is yes.
On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote:
If I have 2 RDDs which depend on the same RDD like the following:
val rdd1 = ...
val rdd2 = rdd1.groupBy()...
How big is your input dataset?
On Thursday, November 27, 2014, Praveen Sripati praveensrip...@gmail.com
wrote:
Hi,
When I run the below program, I see two files in the HDFS because the
number of partitions in 2. But, one of the file is empty. Why is it so? Is
the work not distributed
you can try (scala version = you convert to python)
val set = initial.groupBy( x = if (x == something) key1 else key2)
This would do one pass over original data.
On Fri, Nov 28, 2014 at 8:21 AM, mrm ma...@skimlinks.com wrote:
Hi,
My question is:
I have multiple filter operations where I
We keep conf as symbolic link so that upgrade is as simple as drop-in
replacement
On Monday, November 24, 2014, riginos samarasrigi...@gmail.com wrote:
OK thank you very much for that!
On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List] [hidden
email]
how about using fluent style of Scala programming.
On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com
wrote:
Let's say I have to apply a complex sequence of operations to a certain
RDD.
In order to make code more modular/readable, I would typically have
something like
If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.
On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io
wrote:
Would it make sense to read each file in as a separate
yes, can you always specify minimum number of partitions and that would
force some parallelism ( assuming you have enough cores)
On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
What if the window is of 5 seconds, and the file takes longer than 5
seconds to be
please use join syntax.
On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos
franco.barrien...@exalitica.com wrote:
I have 2 tables in a hive context, and I want to select one field of each
table where id’s of each table are equal. For example,
*val tmp2=sqlContext.sql(select
simple
scala val date = new
java.text.SimpleDateFormat(mmdd).parse(fechau3m)
should work. Replace mmdd with the format fechau3m is in.
If you want to do it at case class level:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
//HiveContext always a good idea
import
did you create SQLContext?
On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary abhinav.chowd...@gmail.com
wrote:
I have same requirement of passing list of values to in clause, when i am
trying to do
i am getting below error
scala val longList = Seq[Expression](a, b)
console:11: error: type
works fine. Spark 1.1.0 on REPL
On Sat, Oct 25, 2014 at 1:41 PM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:
There is for sure a bug in the Accumulators code.
More specifically, the following code works well as expected:
def main(args: Array[String]) {
val conf = new
Hi Tridib,
I changed SQLContext to HiveContext and it started working. These are steps
I used.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val person = sqlContext.jsonFile(json/person.json)
person.printSchema()
person.registerTempTable(person)
val address =
Write to hdfs and then get one file locally bu using hdfs dfs -getmerge...
On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote:
You can save to a local file. What are you trying and what doesn't work?
You can output one file by repartitioning to 1 partition but this is
probably
please add following three libraries to your class path.
spark-streaming-twitter_2.10-1.0.0.jar
twitter4j-core-3.0.3.jar
twitter4j-stream-3.0.3.jar
On Thu, Aug 21, 2014 at 1:09 PM, danilopds danilob...@gmail.com wrote:
Hi!
I'm beginning with the development in Spark Streaming.. And I'm
30 matches
Mail list logo