In 1.5, we will most likely just rewrite distinct in SQL to either use the
Aggregate operator which will benefit from all the Tungsten optimizations,
or have a Tungsten version of distinct for SQL/DataFrame.
On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
This happens sometimes when the download gets stopped or corrupted. You can
verify the integrity of your file by comparing with the md5 and sha
signatures published here: http://www.apache.org/dist/spark/spark-1.3.1/
Pramod
On Wed, May 6, 2015 at 7:16 PM, Praveen Kumar Muthuswamy
Hi
With Spark streaming (all versions), when my processing delay (around 2-4
seconds) exceeds the batch duration (being 1 second) and on a decent
scale/throughput (consuming around 100MB/s on 1+2 node standalone 15GB, 4
cores each) the job will start to throw block not found exceptions when the
I am trying to configure a spark application to run automatic tests using a
spark local context.The part that doesn't work is when I try importing the
functions defined in my main module intro the test_main module. I am using
__init__.py files to configure the project structure, but I guess the
Dear All ,
I have been playing with Spark Streaming on Tachyon as the OFF_HEAP block
store . Primary reason for evaluating Tachyon is to find if Tachyon can
solve the Spark BlockNotFoundException .
In traditional MEMORY_ONLY StorageLevel, when blocks are evicted , jobs
failed due to block not
Hi everyone,
there seems to be different implementations of the distinct feature in
DataFrames and RDD and some performance issue with the DataFrame distinct
API.
In RDD.scala :
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
withScope { map(x = (x,
Ok, but for the moment, this seems to be killing performances on some
computations...
I'll try to give you precise figures on this between rdd and dataframe.
Olivier.
Le jeu. 7 mai 2015 à 10:08, Reynold Xin r...@databricks.com a écrit :
In 1.5, we will most likely just rewrite distinct in SQL
Thanks all for the replies. I seem to have downloaded the redirector html
as Sean mentioned. It works now.
On Thu, May 7, 2015 at 10:36 AM, Frederick R Reiss frre...@us.ibm.com
wrote:
Hi Praveen,
In the past I've downloaded some Spark tarballs that weren't actually
gzipped. Try using tar
Hi all – I’m attempting to build a project with SBT and run it on Spark 1.3
(this previously worked before we upgraded to CDH 5.4 with Spark 1.3).
I have the following in my build.sbt:
scalaVersion := 2.10.4
libraryDependencies ++= Seq(
org.apache.spark %% spark-core % 1.3.0 % provided,
Can anyone please explain -
println(Initalizaing the the KMeans model...)
val model = new
KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect())
where modelfile is *directory to persist the model while training *
REF-
I'd happily merge a PR that changes the distinct implementation to be more
like Spark core, assuming it includes benchmarks that show better
performance for both the fits in memory case and the too big for memory
case.
On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot
One of the best discussion in mailing list :-) ...Please help me in
concluding --
The whole discussion concludes that -
1- Framework does not support increasing parallelism of any task just by
any inbuilt function .
2- User have to manualy write logic for filter output of upstream node
A KMeansModel was trained in the previous step, and it was saved to
modelFile as a Java object file. This step is loading the model back and
reconstructing the KMeansModel, which can then be used to classify new
tweets into different clusters.
Joseph
On Thu, May 7, 2015 at 12:40 PM, anshu shukla
It looks like it was a matter of where you call nosetests from, It had to be
run from within the src/spark folder since there were some other layers
above /src. But I've ran into another problem, the spark context I'm
creating is run under the default python interpreter instead of the one
Hi
I was trying to create an external table named adclicktable by API def
createExternalTable(tableName: String, path: String),then I can get the schema
of this table successfully like below and this table can be queried
normally.The data files are all Parquet files.
sqlContext.sql(describe
Hi all,
Recently in our project, we need to update a RDD using data regularly
received from DStream, I plan to use foreachRDD API to achieve this:
var MyRDD = ...
dstream.foreachRDD { rdd =
MyRDD = MyRDD.join(rdd)...
...
}
Is this usage correct? My concern is, as I am repeatedly
yes, docker. that wonderful little wrapper for linux containers will be
installed and ready for play on all of the jenkins workers tomorrow morning.
the downtime will be super quick: i just need to kill the jenkins slaves'
ssh connections and relaunch to add the jenkins user to the docker
What's the use case?
I'm wondering if we should even expose fromJSON. I think it's more a bug
than feature.
On Thu, May 7, 2015 at 1:55 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Observe, my fellow Sparkophiles (Spark 1.3.1):
json_rdd =
Renaming fields to get around SPARK-2775
https://issues.apache.org/jira/browse/SPARK-2775.
I’m doing this clunky thing:
1. Convert a DataFrame’s schema to JSON, and then a Python dictionary.
2. Replace the problematic characters in the schema field names.
3.
Convert the resulting
Observe, my fellow Sparkophiles (Spark 1.3.1):
json_rdd = sqlContext.jsonRDD(sc.parallelize(['{name: Nick}']))
json_rdd.schema
StructType(List(StructField(name,StringType,true)))
type(json_rdd.schema)
class 'pyspark.sql.types.StructType'
json_rdd.schema.json()
things are currently rebooting.
On Thu, May 7, 2015 at 7:18 AM, shane knapp skn...@berkeley.edu wrote:
this is happening now.
On Wed, May 6, 2015 at 5:44 PM, shane knapp skn...@berkeley.edu wrote:
we've had a spate of issues since the power outage, and now the github
pull request builder
and we're back up and building. thanks for your patience!
On Thu, May 7, 2015 at 7:48 AM, shane knapp skn...@berkeley.edu wrote:
things are currently rebooting.
On Thu, May 7, 2015 at 7:18 AM, shane knapp skn...@berkeley.edu wrote:
this is happening now.
On Wed, May 6, 2015 at 5:44 PM,
I can try that, but the issue is I understand this is supposed to work out
of the box (like it does with all the other Spark/Hadoop pre-built
packages).
On Thu, May 7, 2015 at 12:35 PM Peter Rudenko petro.rude...@gmail.com
wrote:
Try to download this jar:
Yep it's a Hadoop issue: https://issues.apache.org/jira/browse/HADOOP-11863
http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kp28kdejo...@mail.gmail.com%3E
http://stackoverflow.com/a/28033408/3271168
So for now need to manually add that
Details are here: https://issues.apache.org/jira/browse/SPARK-7442
It looks like something specific to building against Hadoop 2.6?
Nick
Is this related to s3a update in 2.6?
On Thursday, May 7, 2015, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Details are here: https://issues.apache.org/jira/browse/SPARK-7442
It looks like something specific to building against Hadoop 2.6?
Nick
Hmm, I just tried changing s3n to s3a:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
Nick
On Thu, May
Try to download this jar:
http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar
And add:
export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar
And try to relaunch.
Thanks,
Peter Rudenko
On 2015-05-07 19:30, Nicholas Chammas wrote:
Hmm, I just
Ah, thanks for the pointers.
So as far as Spark is concerned, is this a breaking change? Is it possible
that people who have working code that accesses S3 will upgrade to use
Spark-against-Hadoop-2.6 and find their code is not working all of a sudden?
Nick
On Thu, May 7, 2015 at 12:48 PM Peter
We should make sure to update our docs to mention s3a as well, since many
people won't look at Hadoop's docs for this.
Matei
On May 7, 2015, at 12:57 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Ah, thanks for the pointers.
So as far as Spark is concerned, is this a breaking
Hi Praveen,
In the past I've downloaded some Spark tarballs that weren't actually
gzipped. Try using tar xvf instead of tar xvzf to extract the files.
Fred
From: Praveen Kumar Muthuswamy muthusamy...@gmail.com
To: dev@spark.apache.org
Date: 05/06/2015 07:18 PM
Subject:unable
31 matches
Mail list logo