Hi Jon,
In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
interchangeably. If you are trying to collect multiple batches across a
DStream into a single RDD, look at the window() operations.
Hope this helps
Nikunj
On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase jon.ch...@gmail.com
From log file:
15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Creating
symlink: /tmp/hadoop-root/mapred/local/1437016615898/user_agents -
/opt/HiBench-master/user_agents
15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Localized
I struggle lots in Scala , almost 10 days n0 improvement , but when i switch to
Java 8 , things are so smooth , and I used Data Frame with Redshift and Hive
all are looking good .if you are very good In Scala the go with Scala otherwise
Java is best fit .
This is just my openion because I am
Hi Ted Thanks for your advice, i found that there is something wrong with
hadoop fs -get command, 'cause I believe the localization of
hdfs://spark-study:9000/HiBench/Aggregation/temp/user_agents to
/tmp/hadoop-root/mapred/local/1437016615898/user_agents is a behaviour like
hadoop fs -get
I should note that the amount of data in each batch is very small, so I'm
not concerned with performance implications of grouping into a single RDD.
On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase jon.ch...@gmail.com wrote:
I'm currently doing something like this in my Spark Streaming program
Have you ever try query the “select * from temp_table” from the spark shell? Or
can you try the option --jars while starting the spark shell?
From: Srikanth [mailto:srikanth...@gmail.com]
Sent: Thursday, July 16, 2015 9:36 AM
To: user
Subject: Re: HiveThriftServer2.startWithContext error with
Thanks a lot :)
On Wed, Jul 15, 2015 at 11:48 PM Marcelo Vanzin van...@cloudera.com wrote:
That has never been the correct way to set you app's classpath.
Instead, look at http://spark.apache.org/docs/latest/configuration.html
and search for extraClassPath.
On Wed, Jul 15, 2015 at 9:43 AM,
You can try selectExpr() of DataFrame. for example,
y-selectExpr(df, concat(hashtags.text[0],hashtags.text[1])) # []
operator is used to extract an item from an array
or
sql(hiveContext, select concat(hashtags.text[0],hashtags.text[1]) from table)
Yeah, the documentation of SparkR is not
Looks like this method should serve Jon's needs:
def reduceByWindow(
reduceFunc: (T, T) = T,
windowDuration: Duration,
slideDuration: Duration
On Wed, Jul 15, 2015 at 8:23 PM, N B nb.nos...@gmail.com wrote:
Hi Jon,
In Spark streaming, 1 batch = 1 RDD. Essentially, the
Hello,
I have a list of rdds
List(rdd1, rdd2, rdd3,rdd4)
I would like to save these rdds in parallel. Right now, it is running each
operation sequentially. I tried using a rdd of rdd but that does not work.
list.foreach { rdd =
rdd.saveAsTextFile(/tmp/cache/)
}
Any ideas?
I'm currently doing something like this in my Spark Streaming program
(Java):
dStream.foreachRDD((rdd, batchTime) - {
log.info(processing RDD from batch {}, batchTime);
// my rdd processing code
});
Instead of having my
Hi, Hunter,
*What **behavior do you see with the HDFS? The local file system and HDFS
should have the same ** behavior.*
*Thanks!*
*- Terry*
Hunter Morgan hunter.mor...@rackspace.com于2015年7月16日周四 上午2:04写道:
After moving the setting of the parameter to SparkConf initialization
instead of
Actually it's supposed to be part of Spark 1.5 release, see
https://issues.apache.org/jira/browse/SPARK-8230
You're definitely welcome to contribute to it, let me know if you have any
question on implementing it.
Cheng Hao
-Original Message-
From: pedro
Hello,
Re-sending this to see if I'm second time lucky!
I've not managed to move past this error.
Srikanth
On Mon, Jul 13, 2015 at 9:14 PM, Srikanth srikanth...@gmail.com wrote:
Hello,
I want to expose result of Spark computation to external tools. I plan to
do this with Thrift server JDBC
Thanks, that fixed the problem.
Cheers
Rishi
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-Error-tp23847p23850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
The latest version of IndexedRDD supports any key type with a defined
serializer
https://github.com/amplab/spark-indexedrdd/blob/master/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/KeySerializer.scala,
including Strings. It's not released yet, but you can use it from the
master branch if
Hi All,
Am trying to add few new rows for existing table in mysql using
DataFrame.But it is adding new rows to the table in local environment but on
spark cluster below is the runtime exception.
Exception in thread main java.lang.RuntimeException: Table msusers_1
already exists.
at
I found out the problem. Grouping by a constant column value is indeed
impossible.
The reason it was working in my project is that I gave the constant
column an alias that exists in the schema of the dataframe. The dataframe
contained a data_timestamp representing an hour, and I added to the
hi all,
Not sure whether this the right venue to ask. If not, please point me to the
right group, if there is any.
I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The
call was successful, and I can see the DataFrame created. The JSON file I
have contains a number of
With regards to Indexed structures in Spark are there any alternatives to
IndexedRDD for more generic keys including Strings?
Thanks
Jem
On Wed, Jul 15, 2015 at 7:41 AM Burak Yavuz brk...@gmail.com wrote:
Hi Swetha,
IndexedRDD is available as a package on Spark Packages
We have a complex application that runs productively for couple of
months and heavily uses spark in scala.
Just to give you some insight on complexity - we do not have such a huge
source data (only about 500'000 complex elements), but we have more than
a billion transformations and
This is very interesting, do you know if this version will be backwards
compatible with older versions of Spark (1.2.0)?
Thanks,
Jem
On Wed, Jul 15, 2015 at 10:04 AM Ankur Dave ankurd...@gmail.com wrote:
The latest version of IndexedRDD supports any key type with a defined
serializer
I see in StructField that we can provide metadata.
What is it meant for ? How is it used by Spark later on ?
Are there any rules on what we can/cannot do with it ?
I'm building some DataFrame processing, and I need to maintain a set of
(meta)data along with the DF. I was wondering if I can use
Why would you create a class and then instantiate it to store data and
change the class every time you have to add a new element? In OOPS
terminology a class represents an object, and an object has states - does
it not?
Purely from a data warehousing perspective - one of the fundamental
For RandomForest classifier, labels should be within the range
[0,numClasses-1]. This means, you have to map your labels to 0,1 instead of
1,2.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-Error-tp23847p23848.html
Sent from the Apache Spark
Looks like the source of the problem is in SqlNewHad\oopRDD.compute method:
Created Parquet file reader is intended to be closed at task completion
time.
This reader contains a lot of references to parquet.bytes.BytesInput object
which in turn contains reference sot large byte arrays (some of
Hi,
I am am evaluating my options for a project that injects a rich data feed,
does some aggregate calculations and allows the user to query on these.
The (protobuf) data feed is rich in the sense that it contains several data
fields which can be used to calculate several different KPI figures.
Hi,
I met some interesting problems with --jars options
As I use the third party dependencies: elasticsearch-spark, I pass this jar
with the following command:
./bin/spark-submit --jars path-to-dependencies ...
It works well.
However, if I use HiveContext.sql, spark will lost the dependencies
Leftouterjoin and join apis are super slow in spark. 100x slower than hadoop
Sent from my iPhone
On 14-Jul-2015, at 10:59 PM, Wush Wu wush...@gmail.com wrote:
I don't understand.
By the way, the `joinWithCassandraTable` does improve my query time
from 40 mins to 3 mins.
2015-07-15
Hi,
I believe the HiveContext uses a different class loader. It then falls back
to the system class loader if it can't find the classes in the context
class loader. The system class loader contains the classpath passed
through --driver-class-path
and spark.executor.extraClassPath. The JVM is
I think any requests going to s3*:// requires the credentials. If they have
made it public (via http) then you won't require the keys.
Thanks
Best Regards
On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto rpagli...@appcomsci.com
wrote:
Hi Sujit,
I just wanted to access public datasets on
Hi,
Facing a bug with group by in SparkSQL (version 1.4).
Registered a JavaRDD with object containing integer fields as a table.
Then I'm trying to do a group by, with a constant value in the group by
fields:
SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
FROM tbl
Hi,
Is this in LibSVM format? If so, the indices should be sorted in increasing
order. It seems like they are not sorted.
Best,
Burak
On Tue, Jul 14, 2015 at 7:31 PM, Vi Ngo Van ngovi.se@gmail.com wrote:
Hi All,
I've met a issue with MLlib when i use LogisticRegressionWithLBFGS
my
Hi Daniel
Well said
Regards
Vineel
On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Hi Shahid,
To be honest I think this question is better suited for Stack Overflow
than for a PhD thesis.
On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf
I think different team got different answer for this question. my team use
scala, and happy with it.
On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers tris...@blackfrog.org
wrote:
We have had excellent results operating on RDDs using Java 8 with Lambdas.
It’s slightly more verbose than Scala,
This is a LibSVM format. I can use this data with libsvm library.
In this sample, they are not sorted. I will sort them and try it again.
Thanks you,
On Wed, Jul 15, 2015 at 1:47 PM, Burak Yavuz brk...@gmail.com wrote:
Hi,
Is this in LibSVM format? If so, the indices should be sorted in
Try to repartition it to a higher number (at least 3-4 times the total # of
cpu cores). What operation are you doing? It may happen that if you are
doing a join/groupBy sort of operation that task which is taking time is
having all the values, in that case you need to use a Partitioner which
will
Hi TD,
Request your guidance on below 5 queries. Following is the context of them
that I would use to evaluate based on your response.
a) I need to decide whether to deploy Spark in Standalone mode or in Yarn.
But it seems to me that Spark in Yarn is more parallel than Standalone mode
(given
Hi
I am trying to train a Random Forest over my dataset. I have a binary
classification problem. When I call the train method as below
model = RandomForest.trainClassifier(data, numClasses=2,
categoricalFeaturesInfo={},numTrees=3, featureSubsetStrategy=auto,
impurity='gini maxDepth=4,
Hi Swetha,
IndexedRDD is available as a package on Spark Packages
http://spark-packages.org/package/amplab/spark-indexedrdd.
Best,
Burak
On Tue, Jul 14, 2015 at 5:23 PM, swetha swethakasire...@gmail.com wrote:
Hi Ankur,
Is IndexedRDD available in Spark 1.4.0? We would like to use this in
The main advantage of using scala vs java 8 is being able to use a console
2015-07-15 9:27 GMT+02:00 诺铁 noty...@gmail.com:
I think different team got different answer for this question. my team
use scala, and happy with it.
On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers
On 15 Jul 2015, at 17:38, Cody Koeninger c...@koeninger.org wrote:
An in-memory hash key data structure of some kind so that you're close to
linear on the number of items in a batch, not the number of outstanding keys.
That's more complex, because you have to deal with expiration for keys
After moving the setting of the parameter to SparkConf initialization instead
of after the context is already initialized, I have it operating reliably on
local filesystem, but not on hdfs. Are there any differences in behavior
between these two cases I should be aware of?
I don’t usually
Dear Sujit,
Thanks for your suggestion.
After testing, the `joinWithCassandraTable` does the trick like what
you mentioned.
The rdd2 only query those data which have the same key in rdd1.
Best,
Wush
2015-07-16 0:00 GMT+08:00 Sujit Pal sujitatgt...@gmail.com:
Hi Wush,
One option may be to
Yes there is.
But the RDD is more than 10 TB and compression does not help.
On Wed, Jul 15, 2015 at 8:36 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. serializeUncompressed()
Is there a method which enables compression ?
Just wondering if that would reduce the memory footprint.
Cheers
On
Hi Cody,
I’ve had success using updateStateByKey for real-time sessionization by aging
off timed-out sessions (returning None in the update function). This was on a
large commercial website with millions of hits per day. This was over a year
ago so I don’t have access to the stats any longer
Sorry Guys!
I mistakenly added my question to this thread( Research ideas using spark).
Moreover people can ask any question , this spark user group is for that.
Cheers!
On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk wrote:
Well said Will. I would add that you might want
I would just like to add that we do the very same/similar thing at Webtrends,
updateStateByKey has been a life-saver for our sessionization use-cases.
Cheers,
Sean
On Jul 15, 2015, at 11:20 AM, Silvio Fiorito
silvio.fior...@granturing.commailto:silvio.fior...@granturing.com wrote:
Hi Cody,
I am trying to analyze my program, in particular to see what the bottleneck
is (IO, CPU, network), and started using the event timeline for this.
When looking at my Job 0, Stage 0 (the sampler function taking up 5.6
minutes of my 40 minute program), I see in the even timeline that all time
is
That has never been the correct way to set you app's classpath.
Instead, look at http://spark.apache.org/docs/latest/configuration.html and
search for extraClassPath.
On Wed, Jul 15, 2015 at 9:43 AM, lokeshkumar lok...@dataken.net wrote:
Hi forum
I have downloaded the latest spark version
Look at this :
http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/
On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf sha...@trialx.com wrote:
Sorry Guys!
I mistakenly added my question to this thread( Research ideas
I am working on a scala code which performs Linear Regression on certain
datasets. Right now I am using 20 cores and 25 executors and everytime I run
a Spark job I get a different result.
The input size of the files are 2GB and 400 MB.However, when I run the job
with 20 cores and 1 executor, I
Don't get me wrong, we've been able to use updateStateByKey for some jobs,
and it's certainly convenient. At a certain point though, iterating
through every key on every batch is a less viable solution.
On Wed, Jul 15, 2015 at 12:38 PM, Sean McNamara sean.mcnam...@webtrends.com
wrote:
I
Hi
I have a requirement of writing in hbase table from Spark streaming app
after some processing.
Is Hbase put operation the only way of writing to hbase or is there any
specialised connector or rdd of spark for hbase write.
Should Bulk load to hbase from streaming app be avoided if output of
On 15/07/2015 08:31, Ignacio Blasco wrote:
The main advantage of using scala vs java 8 is being able to use a console
https://bugs.openjdk.java.net/browse/JDK-8043364
--
Alan Burlison
--
-
To unsubscribe, e-mail:
Hi Mathieu,
metadata is very usefull if you need to save some data about a column
(e.g. count of null values, cardinality, domain, min/max/std, etc.).
It's currently used in ml package in attributes:
Hi,
I want to implement a time-out mechanism in de updateStateByKey(...) routine.
But is there a way the retrieve the time of the start of the batch
corresponding to the call to my updateStateByKey routines?
Suppose the streaming has build up some delay then a System.currentTimeMillis()
will
Hi,
I am trying to test partitioning for DataFrames with parquet usage so i
attempted to do df.write().partitionBy(some_column).parquet(path) on a
small dataset of 20.000 records which when saved as parquet locally with
gzip take 4mb of disk space.
However, on my dev machine with
I have Spark 1.4 on my local machine and I would like to connect to our local 4
nodes Cloudera cluster. But how?
In the example it says text_file = spark.textFile(hdfs://...), but can you
advise me in where to get this hdfs://... -address?
Thanks!
Elina
Hi all,
I am trying to get some summary statistics to retrieve the moving average
for several devices that have an array or latency in seconds in this kind
of format:
deviceLatencyMap = [K:String, Iterable[V: Double]]
I understand that there is a MultivariateSummary, but as this is a trait,
but
suppose df - jsonFile(sqlContext, json file)
You can extract hashtags.text as a Column object using the following command:
t - getField(df$hashtags, text)
and then you can perform operations on the column.
You can extract hashtags.text as a DataFrame using the following command:
t -
Silly question…
When thinking about a PhD thesis… do you want to tie it to a specific
technology or do you want to investigate an idea but then use a specific
technology.
Or is this an outdated way of thinking?
I am doing my PHD thesis on large scale machine learning e.g Online learning,
My choice is java 8
On 15 Jul 2015 18:03, Alan Burlison alan.burli...@oracle.com wrote:
On 15/07/2015 08:31, Ignacio Blasco wrote:
The main advantage of using scala vs java 8 is being able to use a console
https://bugs.openjdk.java.net/browse/JDK-8043364
--
Alan Burlison
--
I would suggest study spark ,flink,strom and based on your understanding
and finding prepare your research paper.
May be you will invented new spark ☺
Regards,
Vaquar khan
On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com wrote:
Silly question…
When thinking about a PhD thesis…
The streaming job has been running ok in 1.2 and 1.3. After I upgraded to
1.4, I started seeing error as below. It appears that it fails in validate
method in StreamingContext. Is there anything changed on 1.4.0 w.r.t
DStream checkpointint?
Detailed error from driver:
15/07/15 18:00:39 ERROR
I think I found my answer at https://github.com/kayousterhout/trace-analysis:
One thing to keep in mind is that Spark does not currently include
instrumentation to measure the time spent reading input data from disk or
writing job output to disk (the `Output write wait'' shown in the waterfall
is
Would there be any problem in having spark.executor.instances (or
--num-executors) be completely ignored (i.e., even for non-zero values) if
spark.dynamicAllocation.enabled is true (i.e., rather than throwing an
exception)?
I can see how the exception would be helpful if, say, you tried to
Totally agreed with hafasa, you need to identify your requirements and
needs before choose spark.
If you want to handle data with fast access go to no sql (mongo,aerospike
etc) if you need data analytical then spark is best .
Regards,
Vaquar khan
On 14 Jul 2015 20:39, Hafsa Asif
When I call the following minimal working example, the accumulator matrix is
32-by-100K, and each executor has 64G but I get an out of memory error:
Exception in thread main java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
Here BDM is a Breeze DenseMatrix
object
Can you show us your function(s) ?
Thanks
On Wed, Jul 15, 2015 at 12:46 PM, Chen Song chen.song...@gmail.com wrote:
The streaming job has been running ok in 1.2 and 1.3. After I upgraded to
1.4, I started seeing error as below. It appears that it fails in validate
method in StreamingContext.
I am trying to write a simple program using addFile Function but getting
error in my worker node that file doest not exist
tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost
task 0.3 in stage 0.0 (TID 3, slave2.novalocal):
java.io.FileNotFoundException: File
Hi
I am trying to write a simple program using addFile Function but getting
error in my worker node that file doest not exist
tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost
task 0.3 in stage 0.0 (TID 3, slave2.novalocal):
java.io.FileNotFoundException: File
Based on the list of functions here:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
there doesn't seem to be a way to get the length of an array in a dataframe
without defining a UDF.
What I would be looking for is something like this (except
Well one of the strength of spark is standardized general distributed
processing allowing many different types of processing, such as graph
processing, stream processing etc. The limitation is that it is less
performant than one system focusing only on one type of processing (eg
graph processing).
bump
From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Tuesday, July 14, 2015 at 4:23 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Unable to use dynamicAllocation if spark.executor.instances is set in
There are there connector packages listed on spark packages web site:
http://spark-packages.org/?q=hbase
HTH.
-Todd
On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Hi
I have a requirement of writing in hbase table from Spark streaming app
after some
Your streaming job may have been seemingly running ok, but the DStream
checkpointing must have been failing in the background. It would have been
visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so
that checkpointing failures dont get hidden in the background.
The fact that
The first two examples are from the .mllib API. Really, the new
ALS()...run() form is underneath both of the first two. In the second
case, you're calling a convenience method that calls something similar
to the first example.
The second example is from the new .ml pipelines API. Similar ideas,
jshell is nice but it is targeting Java 9
Cheers
On Wed, Jul 15, 2015 at 5:31 AM, Alan Burlison alan.burli...@oracle.com
wrote:
On 15/07/2015 08:31, Ignacio Blasco wrote:
The main advantage of using scala vs java 8 is being able to use a console
In the Spark mllib examples MovieLensALS.scala ALS run is used, however in
the movie recommendation with mllib tutorial ALS train is used , What is
the difference, when should you use one versus the other
val model = new ALS()
.setRank(params.rank)
Ah, cool. Thanks.
On Wed, Jul 15, 2015 at 5:58 PM, Tathagata Das t...@databricks.com wrote:
Spark 1.4.1 just got released! So just download that. Yay for timing.
On Wed, Jul 15, 2015 at 2:47 PM, Ted Yu yuzhih...@gmail.com wrote:
Should be this one:
[SPARK-7180] [SPARK-8090]
On 15/07/2015 21:17, Ted Yu wrote:
jshell is nice but it is targeting Java 9
Yes I know, just pointing out that eventually Java would have a REPL as
well.
--
Alan Burlison
--
-
To unsubscribe, e-mail:
Hi all,
Is it possible to use Spark to assign each machine in a cluster the same
task, but on files in each machine's local file system, and then have
the results sent back to the driver program?
Thank you in advance!
Julien
smime.p7s
Description: S/MIME Cryptographic Signature
Resubmitting after fixing subscription to mailing list.
Based on the list of functions here:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
there doesn't seem to be a way to get the length of an array in a dataframe
without defining a UDF.
What I
On Wed, Jul 15, 2015 at 5:36 AM, Jeskanen, Elina elina.jeska...@cgi.com
wrote:
I have Spark 1.4 on my local machine and I would like to connect to our
local 4 nodes Cloudera cluster. But how?
In the example it says text_file = spark.textFile(hdfs://...), but can
you advise me in where to
Thanks
Can you point me to the patch to fix the serialization stack? Maybe I can
pull it in and rerun my job.
Chen
On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das t...@databricks.com wrote:
Your streaming job may have been seemingly running ok, but the DStream
checkpointing must have been
Hi All,
I'm happy to announce the Spark 1.4.1 maintenance release.
We recommend all users on the 1.4 branch upgrade to
this release, which contain several important bug fixes.
Download Spark 1.4.1 - http://spark.apache.org/downloads.html
Release notes -
Spark 1.4.1 just got released! So just download that. Yay for timing.
On Wed, Jul 15, 2015 at 2:47 PM, Ted Yu yuzhih...@gmail.com wrote:
Should be this one:
[SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of
SerializationDebugger bugs and limitations
...
Closes #6625 from
Hi Jonathan,
This is a problem that has come up for us as well, because we'd like
dynamic allocation to be turned on by default in some setups, but not break
existing users with these properties. I'm hoping to figure out a way to
reconcile these by Spark 1.5.
-Sandy
On Wed, Jul 15, 2015 at
I could be wrong, but it looks like the only implementation available right now
is MultivariateOnlineSummarizer.
Mohammed
From: Nkechi Achara [mailto:nkach...@googlemail.com]
Sent: Wednesday, July 15, 2015 4:31 AM
To: user@spark.apache.org
Subject: Any beginner samples for using ML / MLIB to
Yeah, we could make it a log a warning instead.
2015-07-15 14:29 GMT-07:00 Kelly, Jonathan jonat...@amazon.com:
Thanks! Is there an existing JIRA I should watch?
~ Jonathan
From: Sandy Ryza sandy.r...@cloudera.com
Date: Wednesday, July 15, 2015 at 2:27 PM
To: Jonathan Kelly
Should be this one:
[SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of
SerializationDebugger bugs and limitations
...
Closes #6625 from tdas/SPARK-7180 and squashes the following commits:
On Wed, Jul 15, 2015 at 2:37 PM, Chen Song chen.song...@gmail.com wrote:
Thanks
Can you point
Thanks for the reply.
Actually, I don't think excluding spark-hive from spark-submit --packages
is a good idea.
I don't want to recompile spark by assembly for my cluster, every time a
new spark release is out.
I prefer using binary version of spark and then adding some jars for job
execution.
Hi Cody,
oh ... I though that was one of *the* use cases for it. Do you have a
suggestion / best practice how to achieve the same thing with better scaling
characteristics?
Jan
On 15 Jul 2015, at 15:33, Cody Koeninger c...@koeninger.org wrote:
I personally would try to avoid
I'm trying to compute the Frobenius norm error in approximating an
IndexedRowMatrix A with the product L*R where L and R are Breeze
DenseMatrices.
I've written the following function that computes the squared error over
each partition of rows then sums up to get the total squared error (ignore
I am using the below code and using kryo serializer .when i run this code i
got this error : Task not serializable in commented line2) how broadcast
variables are treated in exceotu.are they local variables or can be used in any
function defined as global variables.
object
Hi Roberto,
I think you would need to as Akhil said. Just checked from this page:
http://aws.amazon.com/public-data-sets/
and clicking through to a few dataset links, all of them are available on
s3 (some are available via http and ftp, but I think the point of these
datasets are that they are
Hi Wush,
One option may be to try a replicated join. Since your rdd1 is small, read
it into a collection and broadcast it to the workers, then filter your
larger rdd2 against the collection on the workers.
-sujit
On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain deepuj...@gmail.com wrote:
I'm using the following function to compute B*A where B is a 32-by-8mil
Breeze DenseMatrix and A is a 8mil-by-100K IndexedRowMatrix.
// computes BA where B is a local matrix and A is distributed: let b_i
denote the
// ith col of B and a_i denote the ith row of A, then BA = sum(b_i a_i)
def
bq. serializeUncompressed()
Is there a method which enables compression ?
Just wondering if that would reduce the memory footprint.
Cheers
On Wed, Jul 15, 2015 at 8:06 AM, Saeed Shahrivari
saeed.shahriv...@gmail.com wrote:
I use a simple map/reduce step in a Java/Spark program to remove
1 - 100 of 115 matches
Mail list logo