Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Reynold Xin
In 1.5, we will most likely just rewrite distinct in SQL to either use the Aggregate operator which will benefit from all the Tungsten optimizations, or have a Tungsten version of distinct for SQL/DataFrame. On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote:

Re: unable to extract tgz files downloaded from spark

2015-05-07 Thread Pramod Biligiri
This happens sometimes when the download gets stopped or corrupted. You can verify the integrity of your file by comparing with the md5 and sha signatures published here: http://www.apache.org/dist/spark/spark-1.3.1/ Pramod On Wed, May 6, 2015 at 7:16 PM, Praveen Kumar Muthuswamy

SparkStreaming Workaround for BlockNotFound Exceptions

2015-05-07 Thread Akhil Das
Hi With Spark streaming (all versions), when my processing delay (around 2-4 seconds) exceeds the batch duration (being 1 second) and on a decent scale/throughput (consuming around 100MB/s on 1+2 node standalone 15GB, 4 cores each) the job will start to throw block not found exceptions when the

Automatic testing for Spark App developed in Python

2015-05-07 Thread bbarbieru
I am trying to configure a spark application to run automatic tests using a spark local context.The part that doesn't work is when I try importing the functions defined in my main module intro the test_main module. I am using __init__.py files to configure the project structure, but I guess the

Spark Streaming with Tachyon : Some findings

2015-05-07 Thread Dibyendu Bhattacharya
Dear All , I have been playing with Spark Streaming on Tachyon as the OFF_HEAP block store . Primary reason for evaluating Tachyon is to find if Tachyon can solve the Spark BlockNotFoundException . In traditional MEMORY_ONLY StorageLevel, when blocks are evicted , jobs failed due to block not

DataFrame distinct vs RDD distinct

2015-05-07 Thread Olivier Girardot
Hi everyone, there seems to be different implementations of the distinct feature in DataFrames and RDD and some performance issue with the DataFrame distinct API. In RDD.scala : def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { map(x = (x,

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Olivier Girardot
Ok, but for the moment, this seems to be killing performances on some computations... I'll try to give you precise figures on this between rdd and dataframe. Olivier. Le jeu. 7 mai 2015 à 10:08, Reynold Xin r...@databricks.com a écrit : In 1.5, we will most likely just rewrite distinct in SQL

Re: unable to extract tgz files downloaded from spark

2015-05-07 Thread Praveen Kumar Muthuswamy
Thanks all for the replies. I seem to have downloaded the redirector html as Sean mentioned. It works now. On Thu, May 7, 2015 at 10:36 AM, Frederick R Reiss frre...@us.ibm.com wrote: Hi Praveen, In the past I've downloaded some Spark tarballs that weren't actually gzipped. Try using tar

NoClassDefFoundError with Spark 1.3

2015-05-07 Thread Ganelin, Ilya
Hi all – I’m attempting to build a project with SBT and run it on Spark 1.3 (this previously worked before we upgraded to CDH 5.4 with Spark 1.3). I have the following in my build.sbt: scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.3.0 % provided,

Predict.scala using model for clustering In reference

2015-05-07 Thread anshu shukla
Can anyone please explain - println(Initalizaing the the KMeans model...) val model = new KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect()) where modelfile is *directory to persist the model while training * REF-

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Michael Armbrust
I'd happily merge a PR that changes the distinct implementation to be more like Spark core, assuming it includes benchmarks that show better performance for both the fits in memory case and the too big for memory case. On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot

Re: Map one RDD into two RDD

2015-05-07 Thread anshu shukla
One of the best discussion in mailing list :-) ...Please help me in concluding -- The whole discussion concludes that - 1- Framework does not support increasing parallelism of any task just by any inbuilt function . 2- User have to manualy write logic for filter output of upstream node

Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to modelFile as a Java object file. This step is loading the model back and reconstructing the KMeansModel, which can then be used to classify new tweets into different clusters. Joseph On Thu, May 7, 2015 at 12:40 PM, anshu shukla

Re: Automatic testing for Spark App developed in Python

2015-05-07 Thread bbarbieru
It looks like it was a matter of where you call nosetests from, It had to be run from within the src/spark folder since there were some other layers above /src. But I've ran into another problem, the spark context I'm creating is run under the default python interpreter instead of the one

Hive can not get the schema of an external table created by Spark SQL API createExternalTable

2015-05-07 Thread zhangxiongfei
Hi I was trying to create an external table named adclicktable by API def createExternalTable(tableName: String, path: String),then I can get the schema of this table successfully like below and this table can be queried normally.The data files are all Parquet files. sqlContext.sql(describe

Possible long lineage issue when using DStream to update a normal RDD

2015-05-07 Thread Chunnan Yao
Hi all, Recently in our project, we need to update a RDD using data regularly received from DStream, I plan to use foreachRDD API to achieve this: var MyRDD = ... dstream.foreachRDD { rdd = MyRDD = MyRDD.join(rdd)... ... } Is this usage correct? My concern is, as I am repeatedly

[build infra] quick downtime again tomorrow morning for DOCKER

2015-05-07 Thread shane knapp
yes, docker. that wonderful little wrapper for linux containers will be installed and ready for play on all of the jenkins workers tomorrow morning. the downtime will be super quick: i just need to kill the jenkins slaves' ssh connections and relaunch to add the jenkins user to the docker

Re: pyspark.sql.types.StructType.fromJson() is a lie

2015-05-07 Thread Reynold Xin
What's the use case? I'm wondering if we should even expose fromJSON. I think it's more a bug than feature. On Thu, May 7, 2015 at 1:55 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Observe, my fellow Sparkophiles (Spark 1.3.1): json_rdd =

Re: pyspark.sql.types.StructType.fromJson() is a lie

2015-05-07 Thread Nicholas Chammas
Renaming fields to get around SPARK-2775 https://issues.apache.org/jira/browse/SPARK-2775. I’m doing this clunky thing: 1. Convert a DataFrame’s schema to JSON, and then a Python dictionary. 2. Replace the problematic characters in the schema field names. 3. Convert the resulting

pyspark.sql.types.StructType.fromJson() is a lie

2015-05-07 Thread Nicholas Chammas
Observe, my fellow Sparkophiles (Spark 1.3.1): json_rdd = sqlContext.jsonRDD(sc.parallelize(['{name: Nick}'])) json_rdd.schema StructType(List(StructField(name,StringType,true))) type(json_rdd.schema) class 'pyspark.sql.types.StructType' json_rdd.schema.json()

Re: [build system] quick jenkins restart thursday morning (5-6-15) 7am PDT

2015-05-07 Thread shane knapp
things are currently rebooting. On Thu, May 7, 2015 at 7:18 AM, shane knapp skn...@berkeley.edu wrote: this is happening now. On Wed, May 6, 2015 at 5:44 PM, shane knapp skn...@berkeley.edu wrote: we've had a spate of issues since the power outage, and now the github pull request builder

Re: [build system] quick jenkins restart thursday morning (5-6-15) 7am PDT

2015-05-07 Thread shane knapp
and we're back up and building. thanks for your patience! On Thu, May 7, 2015 at 7:48 AM, shane knapp skn...@berkeley.edu wrote: things are currently rebooting. On Thu, May 7, 2015 at 7:18 AM, shane knapp skn...@berkeley.edu wrote: this is happening now. On Wed, May 6, 2015 at 5:44 PM,

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
I can try that, but the issue is I understand this is supposed to work out of the box (like it does with all the other Spark/Hadoop pre-built packages). On Thu, May 7, 2015 at 12:35 PM Peter Rudenko petro.rude...@gmail.com wrote: Try to download this jar:

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Peter Rudenko
Yep it's a Hadoop issue: https://issues.apache.org/jira/browse/HADOOP-11863 http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kp28kdejo...@mail.gmail.com%3E http://stackoverflow.com/a/28033408/3271168 So for now need to manually add that

Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
Details are here: https://issues.apache.org/jira/browse/SPARK-7442 It looks like something specific to building against Hadoop 2.6? Nick

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Reynold Xin
Is this related to s3a update in 2.6? On Thursday, May 7, 2015, Nicholas Chammas nicholas.cham...@gmail.com wrote: Details are here: https://issues.apache.org/jira/browse/SPARK-7442 It looks like something specific to building against Hadoop 2.6? Nick

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
Hmm, I just tried changing s3n to s3a: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found Nick ​ On Thu, May

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Peter Rudenko
Try to download this jar: http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar And add: export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar And try to relaunch. Thanks, Peter Rudenko On 2015-05-07 19:30, Nicholas Chammas wrote: Hmm, I just

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
Ah, thanks for the pointers. So as far as Spark is concerned, is this a breaking change? Is it possible that people who have working code that accesses S3 will upgrade to use Spark-against-Hadoop-2.6 and find their code is not working all of a sudden? Nick On Thu, May 7, 2015 at 12:48 PM Peter

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Matei Zaharia
We should make sure to update our docs to mention s3a as well, since many people won't look at Hadoop's docs for this. Matei On May 7, 2015, at 12:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, thanks for the pointers. So as far as Spark is concerned, is this a breaking

Re: unable to extract tgz files downloaded from spark

2015-05-07 Thread Frederick R Reiss
Hi Praveen, In the past I've downloaded some Spark tarballs that weren't actually gzipped. Try using tar xvf instead of tar xvzf to extract the files. Fred From: Praveen Kumar Muthuswamy muthusamy...@gmail.com To: dev@spark.apache.org Date: 05/06/2015 07:18 PM Subject:unable