Hello,
I'm finding problems to run a spark streaming job for more than a few hours
(3 or 4). It begins working OK, but it degrades until failure. Some of the
symptoms:
- Consumed memory and CPU keeps getting higher ang higher, and finally some
error is being thrown (java.lang.Exception: Could not
I was running the spark shell and sql with --jars option containing the paths
when I got my error. What is the correct way to add jars I am not sure. I
tried placing the jar inside the directory you said but still get the error.
I will give the code you posted a try. Thanks.
--
View this
And yet another way is to demultiplex at one point which will yield separate
DStreams for each message type which you can then process in independent DAG
pipelines in the following way:
MessageType1DStream = MainDStream.filter(message type1)
MessageType2DStream = MainDStream.filter(message
From experience, I'd recommend using the dstream.foreachRDD method and
doing the filtering within that context. Extending the example of TD,
something like this:
dstream.foreachRDD { rdd =
rdd.cache()
messageType.foreach (msgTyp =
val selection = rdd.filter(msgTyp.match(_))
I am running the queries from spark-sql. I don't think it can communicate
with thrift server. Can you tell how I should run the quries to make it
work.
--
View this message in context:
Looks a good option. BTW v3.0 is round the corner.
http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html
Sent from the
Hi,
I'm new with both Cassandra and Spark and am experimenting with what Spark
SQL can do as it will affect my Cassandra data model.
What I need is a model that can accept arbitrary fields, similar to
Postgres's Hstore. Right now, I'm trying out the map type in Cassandra but
I'm getting the
Ooops – what does “more work” mean in a Parallel Programming paradigm and does
it always translate in “inefficiency”
Here are a few laws of physics in this space:
1. More Work if done AT THE SAME time AND fully utilizes the cluster
resources is a GOOD thing
2. More Work
Thanks Nick. I understand that we can configure the index by creating
the index with the mapping first. I thought it will be a good feature
to be added in the es-hadoop /es-spark as we can have the full mapping
and code in a single space especially for simple mappings on a
particular field. It
I used to hit this issue when my processing time exceeds the batch
duration. Here's a few workarounds:
- Use storage level MEMORY_AND_DISK
- Enable WAL and check pointing
Above two will slow down things a little bit.
If you want low latency, what you can try is:
- Use storage level as
Also you can have each message type in a different topic (needs to be arranged
upstream from your Spark Streaming app ie in the publishing systems and the
messaging brokers) and then for each topic you can have a dedicated instance of
InputReceiverDStream which will be the start of a dedicated
Hi,
has this issue been resolved? I am currently running into similar problems.
I am using spark-1.3.0-bin-hadoop2.4 on Windows and Ubuntu. I have setup all
path on my Windows machine in an identical manner as on my Ubuntu server
(using cygwin, so everything is somewhere under
Thank you very much for your suggestions, Ignacio!
I have posted my solution here:
http://stackoverflow.com/questions/29649904/save-spark-org-apache-spark-mllib-linalg-matrix-to-a-file/29671193#29671193
Best regards,
Florin
On Wed, Apr 15, 2015 at 5:28 PM, Ignacio Blasco elnopin...@gmail.com
Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/
I put that jar in /tmp on the driver/machine I’m running spark shell from.
Then I ran with ./bin/spark-shell --jars /tmp/jtds-1.3.1.jar --master
yarn-client
So I’m guessing that --jars doesn’t set the class path for the
Hi
As per JIRA this issue is resolved, but i am still facing this issue.
SPARK-2734 - DROP TABLE should also uncache table
--
[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com
*Arush Kharbanda* || Technical Teamlead
ar...@sigmoidanalytics.com ||
Hi
How to specify custom input format in spark and control isSplitable in
between file.
Need to read a file from HDFS , file format is custom and requirement is
file should not be split inbetween when a executor node gets that partition
of input dir.
Can anyone share a sample in java.
Thanks
Hello guys,
after upgrading spark to 1.3.0 (and performing necessary code changes) an
issue appeared making me unable to handle Date fields (java.sql.Date) with
Spark SQL module. An exception appears in the console when I try to execute
and SQL query on a DataFrame (see below).
When I tried to
...one additional note:
implementation of org.apache.spark.sql.columnar.IntColumnStats is IMHO
wrong. Small hint - what will be the resulting upper and lower values for
column containing no data (empty RDD or null values in Int column across the
whole RDD)?
Shouldn't they be null?
--
View
Hi,
I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But
I'm getting java.lang.OutOfMemoryError: Java heap space error. The dataset
is really sparse and have around 8 million data points and 20 million
features. I'm using a cluster of 8 nodes (each with 8 cores and 64G RAM).
I have a big data file, i aim to create index on the data. I want to
partition the data based on user defined function in Spark-GraphX (Scala).
Further i want to keep track the node on which a particular data partition
is send and being processed so i could fetch the required data by accessing
How do you intend to fetch the required data - from within Spark or using
an app / code / module outside Spark
-Original Message-
From: mas [mailto:mas.ha...@gmail.com]
Sent: Thursday, April 16, 2015 4:08 PM
To: user@spark.apache.org
Subject: Data partitioning and node tracking in
Dear all,
Here is an issue that gets me mad. I wrote a UserDefineType in order to be
able to store a custom type in a parquet file. In my code I just create a
DataFrame with my custom data type and write in into a parquet file. When I
run my code directly inside idea every thing works like a
Thanks a lot for the reply. Indeed it is useful but to be more precise i
have 3D data and want to index it using octree. Thus i aim to build a two
level indexing mechanism i.e. First at global level i want to partition and
send the data to the nodes then at node level i again want to use octree to
This would be much, much faster if your set of IDs was simply a Set,
and you passed that to a filter() call that just filtered in the docs
that matched an ID in the set.
On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Does anybody have a solution for
We run Spark on Mac and Linux but also need to run it on Windows 8.1 and
Windows Server. We ran into problems with the Scala 2.10 binary bundle for
Spark 1.3.0 but managed to get it working. However, on Mac/Linux, we are on
Scala 2.11.6 (we built Spark from the sources). On Windows, however
Does anybody have a solution for this?
From: Wang, Ningjun (LNG-NPV)
Sent: Tuesday, April 14, 2015 10:41 AM
To: user@spark.apache.org
Subject: How to join RDD keyValuePairs efficiently
I have an RDD that contains millions of Document objects. Each document has an
unique Id that is a string. I
Ningjun, to speed up your current design you can do the following:
1.partition the large doc RDD based on the hash function on the key ie the docid
2. persist the large dataset in memory to be available for subsequent queries
without reloading and repartitioning for every search query
3.
I want to use Spark functions/APIs to do this task. My basic purpose is to
index the data and divide and send it to multiple nodes. Then at the time
of accessing i want to reach the right node and data partition. I don't
have any clue how to do this.
Thanks,
On Thu, Apr 16, 2015 at 5:13 PM, Evo
Well you can use a [Key, Value] RDD and partition it based on hash function on
the Key and even a specific number of partitions (and hence cluster nodes).
This will a) index the data, b) divide it and send it to multiple nodes. Re
your last requirement - in a cluster programming
Well you can have a two level index structure, still without any need for
physical cluster node awareness
Level 1 Index is the previously described partitioned [K,V] RDD – this gets you
to the value (RDD element) you need on the respective cluster node
Level 2 Index – it will be built
I have a data frame in which I load data from a hive table. And my issue is
that the data frame is missing the columns that I need to query.
For example:
val newdataset = dataset.where(dataset(label) === 1)
gives me an error like the following:
ERROR yarn.ApplicationMaster: User class threw
Try increasing your driver memory.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 6:09 PM, sarath sarathkrishn...@gmail.com wrote:
Hi,
I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But
I'm getting java.lang.OutOfMemoryError: Java heap space error. The
dataset
is
Hi All
I have a RDD which has 1 million keys and each key is repeated from around
7000 values so total there will be around 1M*7K records in RDD.
and each key is created from ZipWithIndex so key start from 0 to M-1
the problem with ZipWithIndex is it take long for key which is 8 bytes. can
I
Is it for spark?
On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can simply override the isSplitable method in your custom inputformat
class and make it return false.
Here's a sample code snippet:
Can you paste your complete code? Did you try repartioning/increasing level
of parallelism to speed up the processing. Since you have 16 cores, and I'm
assuming your 400k records isn't bigger than a 10G dataset.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele
Hi,
Is there a document/link that describes the general configuration settings to
achieve maximum Spark Performance while running on CDH5? In our environment, we
did lot of changes (and still doing it) to get decent performance otherwise our
6 node dev cluster with default configurations, lags
You could try repartitioning your RDD using a custom partitioner
(HashPartitioner etc) and caching the dataset into memory to speedup the
joins.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I have an RDD that contains
Hi All I have below code whether distinct is running for more time.
blockingRdd is the combination of Long,String and it will have 400K
records
JavaPairRDDLong,Integer completeDataToprocess=blockingRdd.flatMapValues(
new FunctionString, IterableInteger(){
@Override
public IterableInteger
Open the driver ui and see which stage is taking time, you can look whether
its adding any GC time etc.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Hi All I have below code whether distinct is running for more time.
blockingRdd is the
I already checked and G is taking 1 secs for each task. is this too much?
if yes how to avoid this?
On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote:
Open the driver ui and see which stage is taking time, you can look
whether its adding any GC time etc.
Thanks
Best
Hi Michael,
Good question! We checked 1.2 and found that it is also slow cacheing
the same flat parquet file. Caching other file formats of the same
data were faster by up to a factor of ~2. Note that the parquet file
was created in Impala but the other formats were written by Spark SQL.
Cheers,
Well there are a number of performance tuning guidelines in dedicated
sections of the spark documentation - have you read and applied them
Secondly any performance problem within a distributed cluster environment
has two aspects:
1. Infrastructure
2. App Algorithms
You
You can simply override the isSplitable method in your custom inputformat
class and make it return false.
Here's a sample code snippet:
http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop#answers-header
Thanks
Best Regards
On Thu, Apr 16, 2015 at 4:18 PM,
You can plug in the native hadoop input formats with Spark's
sc.newApiHadoopFile etc which takes in the inputformat.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 10:15 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Is it for spark?
On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das
Michael what exactly do you mean by flattened version/structure here e.g.:
1. An Object with only primitive data types as attributes
2. An Object with no more than one level of other Objects as attributes
3. An Array/List of primitive types
4. An Array/List of Objects
This question is in
Thanks Evo. Yes, my concern is only regarding the infrastructure
configurations. Basically, configuring Yarn (Node manager) + Spark is must and
default setting never works. And what really happens, is we make changes as and
when an issue is faced because of one of the numerous default
Can your code that can reproduce the problem?
On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
Hi
As per JIRA this issue is resolved, but i am still facing this issue.
SPARK-2734 - DROP TABLE should also uncache table
--
[image: Sigmoid Analytics]
Bummer - out of curiosity, if you were to use the classpath.first or
perhaps copy the jar to the slaves could that actually do the trick? The
latter isn't really all that efficient but just curious if that could do
the trick.
On Thu, Apr 16, 2015 at 7:14 AM ARose ashley.r...@telarix.com wrote:
I take it back. My solution only works when you set the master to local. I
get the same error when I try to run it on the cluster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22525.html
Sent from the
Hi Sparkers,
hoping for insight here:
running a simple describe mytable here where mytable is a partitioned Hive
table.
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02,
SQL query: 72.831, Reading results: 0.189
Whereas Hive over the
Essentially to change the performance yield of software cluster
infrastructure platform like spark you play with different permutations of:
- Number of CPU cores used by Spark Executors on every cluster node
- Amount of RAM allocated for each executor
How disks and
I am using Spark Streaming where during each micro-batch I output data to
S3 using
saveAsTextFile. Right now each batch of data is put into its own directory
containing
2 objects, _SUCCESS and part-0.
How do I output each batch into a common directory?
Thanks,
Vadim
ᐧ
Thanks Sean. I want to load each batch into Redshift. What's the best/most
efficient way to do that?
Vadim
On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote:
You can't, since that's how it's designed to work. Batches are saved
in different files, which are really directories
The reason for this is as follows:
1. You are saving data on HDFS
2. HDFS as a cluster/server side Service has a Single Writer / Multiple
Reader multithreading model
3. Hence each thread of execution in Spark has to write to a separate
file in HDFS
4. Moreover the
Nop Sir, it is possible - check my reply earlier
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Thursday, April 16, 2015 6:35 PM
To: Vadim Bichutskiy
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile
You can't, since that's how it's designed to work. Batches
Basically you need to unbundle the elements of the RDD and then store them
wherever you want - Use foreacPartition and then foreach
-Original Message-
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:39 PM
To: Sean Owen
Cc:
Just copy the files? it shouldn't matter that much where they are as
you can find them easily. Or consider somehow sending the batches of
data straight into Redshift? no idea how that is done but I imagine
it's doable.
On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com
Also to juggle even further the multithreading model of both spark and HDFS you
can even publish the data from spark first to a message broker e.g. kafka from
where a predetermined number (from 1 to infinity) of parallel consumers will
retrieve and store in HDFS in one or more finely controlled
Copy should be doable but I'm not sure how to specify a prefix for the
directory while keeping the filename (ie part-0) fixed in copy command.
On Apr 16, 2015, at 1:51 PM, Sean Owen so...@cloudera.com wrote:
Just copy the files? it shouldn't matter that much where they are as
you can
Hi everyone,
I have a large RDD and I am trying to create a RDD of a random sample of
pairs of elements from this RDD. The elements composing a pair should come
from the same partition for efficiency. The idea I've come up with is to
take two random samples and then use zipPartitions to pair each
I don't think there's anything specific to CDH that you need to know,
other than it ought to set things up sanely for you.
Sandy did a couple posts about tuning:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
No I did not tried the partitioning below is the full code
public static void matchAndMerge(JavaRDDVendorRecord
matchRdd,JavaSparkContext jsc) throws IOException{
long start = System.currentTimeMillis();
JavaPairRDDLong, MatcherReleventData RddForMarch
=matchRdd.zipWithIndex().mapToPair(new
You can't, since that's how it's designed to work. Batches are saved
in different files, which are really directories containing
partitions, as is common in Hadoop. You can move them later, or just
read them where they are.
On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to
Windows. AFAIK it should build on Windows too, the only problem is that Maven
might take a long time to download dependencies. What errors are you seeing?
Matei
On Apr 16, 2015, at 9:23 AM, Arun Lists
Thanks Evo for your detailed explanation.
On Apr 16, 2015, at 1:38 PM, Evo Eftimov evo.efti...@isecc.com wrote:
The reason for this is as follows:
1. You are saving data on HDFS
2. HDFS as a cluster/server side Service has a Single Writer / Multiple
Reader multithreading
On Thu, Apr 16, 2015 at 12:16:13PM -0700, Marcelo Vanzin wrote:
I think Michael is referring to this:
Exception in thread main java.lang.IllegalArgumentException: You
must specify at least 1 executor!
Usage: org.apache.spark.deploy.yarn.Client [options]
Yes, sorry, there were too many mins
Oh, just noticed that I missed attach... Yeah, your scripts will be
helpful. Thanks!
On Thu, Apr 16, 2015 at 12:03 PM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
Yes, i am able to reproduce the problem. Do you need the scripts to create
the tables?
On Thu, Apr 16, 2015 at 10:50 PM,
I have some trouble to control number of spark tasks for a stage. This on
latest spark 1.3.x source code build.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sc.getConf.get(spark.default.parallelism) - setup to 10
val t1 = hiveContext.sql(FROM SalesJan2009 select * )
val t2 =
I think Michael is referring to this:
Exception in thread main java.lang.IllegalArgumentException: You
must specify at least 1 executor!
Usage: org.apache.spark.deploy.yarn.Client [options]
spark-submit --conf spark.dynamicAllocation.enabled=true --conf
spark.dynamicAllocation.minExecutors=0
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote:
IIRC that was fixed already in 1.3
https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b
From that commit:
+ private val minNumExecutors =
conf.getInt(spark.dynamicAllocation.minExecutors, 0)
...
+ if
Thanks, Matei! We'll try that and let you know if it works. You are correct
in inferring that some of the problems we had were with dependencies.
We also had problems with the spark-submit scripts. I will get the details
from the engineer who worked on the Windows builds and provide them to you.
empty folders generally means that you need to just increase the window
intervals; i.e. spark streaming
saveAsTxtFiles will save folders for each interval regardless
On Wed, Apr 15, 2015 at 5:03 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Its printing on console but on HDFS all
If it fails with sbt-assembly but not without it, then there's always the
likelihood of a classpath issue. What dependencies are you rolling up into
your assembly jar?
On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Any ideas ?
On Thu, Apr 16, 2015 at 5:04 PM,
On Thu, Apr 16, 2015 at 08:10:54PM +0100, Sean Owen wrote:
Yes, look what it was before -- would also reject a minimum of 0.
That's the case you are hitting. 0 is a fine minimum.
How can 0 be a fine minimum if it's rejected? Changing the value is easy
enough, but in general it's nice for
Akhil, any thought on this?
On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote:
No I did not tried the partitioning below is the full code
public static void matchAndMerge(JavaRDDVendorRecord
matchRdd,JavaSparkContext jsc) throws IOException{
long start =
Never mind. I found the solution:
val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
hiveLoadedDataFrame.schema)
which translate to convert the data frame to rdd and back again to data
frame. Not the prettiest solution, but at least it solves my problems.
Thanks,
Cesar Flores
On
Any ideas ?
On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Dear all,
Here is an issue that gets me mad. I wrote a UserDefineType in order to be
able to store a custom type in a parquet file. In my code I just create a
DataFrame with my custom data type and write
Does this same functionality exist with Java?
On 17 April 2015 at 02:23, Evo Eftimov evo.efti...@isecc.com wrote:
You can use
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
Return a copy of the RDD partitioned using the specified partitioner
The
i believe it is a generalization of some classes inside graphx, where there
was/is a need to keep stuff indexed for random access within each rdd
partition
On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Can somebody from Data Briks sched more light on this Indexed RDD
Filed: https://issues.apache.org/jira/browse/SPARK-6967
Shouldn't they be null?
Statistics are only used to eliminate partitions that can't possibly hold
matching values. So while you are right this might result in a false
positive, that will not result in a wrong answer.
Yes, look what it was before -- would also reject a minimum of 0.
That's the case you are hitting. 0 is a fine minimum.
On Thu, Apr 16, 2015 at 8:09 PM, Michael Stone mst...@mathom.us wrote:
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote:
IIRC that was fixed already in 1.3
Looks like that message would be triggered if
spark.dynamicAllocation.initialExecutors was not set, or 0, if I read
this right. Yeah, that might have to be positive. This requires you
set initial executors to 1 if you want 0 min executors. Hm, maybe that
shouldn't be an error condition in the args
For Map type column, fields['driver'] is the syntax to retrieve the map
value (in the schema, you can see fields: map). The syntax of
fields.driver is used for struct type.
On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco jc.francisc...@gmail.com
wrote:
Hi,
I'm new with both Cassandra and Spark
Hi Guillaume,
Interesting that you brought up Shuffle. In fact we are experiencing this
issue of shuffle files being left behind and not being cleaned up. Since
this is a Spark streaming application, it is expected to stay up
indefinitely, so shuffle files being left is a big problem right now.
Yes, i am able to reproduce the problem. Do you need the scripts to create
the tables?
On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai yh...@databricks.com wrote:
Can your code that can reproduce the problem?
On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
Hi
Hi,
I have data in my ElasticSearch server, when I query it using rest
interface, I get results and score for each result, but when I run the
same query in spark using ElasticSearch API, I get results and meta
data, but the score is shown 0 for each record.
My configuration is
...
val conf = new
You can use
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
Return a copy of the RDD partitioned using the specified partitioner
The https://github.com/amplab/spark-indexedrdd stuff looks pretty cool and is
something which adds valuable functionality to spark e.g. the point lookups
Can you please guide me how can I extend RDD and convert into this way you
are suggesting.
On 16 April 2015 at 23:46, Jeetendra Gangele gangele...@gmail.com wrote:
I type T i already have Object ... I have RDDObject and then I am
calling ZipWithIndex on this RDD and getting RDDObject,Long on
Can somebody from Data Briks sched more light on this Indexed RDD library
https://github.com/amplab/spark-indexedrdd
It seems to come from AMP Labs and most of the Data Bricks guys are from
there
What is especially interesting is whether the Point Lookup (and the other
primitives) can work
Thanks but we need a firm statement and preferably from somebody from the spark
vendor Data Bricks including answer to the specific question posed by me and
assessment/confirmation whether this is a production ready / quality library
which can be used for general purpose RDDs not just inside
I have a big dataset of categories of cars and descriptions of cars. So i
want to give a description of a car and the program to classify the category
of that car.
So i decided to use multinomial naive Bayes. I created a unique id for each
word and replaced my whole category,description data.
Evo
partition the large doc RDD based on the hash function on the key ie
the docid
What API to use to do this?
By the way, loading the entire dataset to memory cause OutOfMemory problem
because it is too large (I only have one machine with 16GB and 4 cores).
I found something called
Here is the list of my dependencies :
*libraryDependencies ++= Seq(*
* org.apache.spark %% spark-core % sparkVersion % provided,
org.apache.spark %% spark-sql % sparkVersion, org.apache.spark
%% spark-mllib % sparkVersion, org.iq80.leveldb % leveldb %
0.7, com.github.fommil.netlib
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Thursday, April 16, 2015 9:57 PM
To: Evo Eftimov
Cc: Wang, Ningjun (LNG-NPV); user
Subject: Re: How to join RDD keyValuePairs efficiently
Does this same
Hi ,
how to convert this script to java 8 with lampdas ?
My problem is the function getactivations() returns a
scala.collection.IteratorNode to
mapPartitions() that need a java.util.IteratorString ...
Thks !
===
// Step 1 - Stub code to copy
Hi Aurelien,
Sean's solution is nice, but maybe not completely order-free, since
pairs will come from the same partition.
The easiest / fastest way to do it in my opinion is to use a random key
instead of a zipWithIndex. Of course you'll not be able to ensure
uniqueness of each elements of
I'm the primary author of IndexedRDD. To answer your questions:
1. Operations on an IndexedRDD partition can only be performed from a task
operating on that partition, since doing otherwise would require
decentralized coordination between workers, which is difficult in Spark. If
you want to
Can you tell us how did you create the dataframe?
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, April 17, 2015 2:52 AM
To: rkrist
Cc: user
Subject: Re: ClassCastException processing date fields using spark SQL since
1.3.0
Filed:
Hello,
We wanted to tune the Spark running on YARN cluster.The Spark History
Server UI shows lots of parameters like:
- GC time
- Task Duration
- Shuffle R/W
- Shuffle Spill (Memory/Disk)
- Serialization Time (Task/Result)
- Scheduler Delay
Among the above metrics, which are
The hadoop support from HortonWorks only *actually *works with Windows
Server - well at least as of Spark Summit last year : and AFAIK that has
not changed since
2015-04-16 15:18 GMT-07:00 Dean Wampler deanwamp...@gmail.com:
If you're running Hadoop, too, now that Hortonworks supports Spark,
1 - 100 of 117 matches
Mail list logo