Re: IndentationCheck of checkstyle

2015-12-30 Thread Ted Yu
Right. Pardon my carelessness. > On Dec 29, 2015, at 9:58 PM, Reynold Xin <r...@databricks.com> wrote: > > OK to close the loop - this thread has nothing to do with Spark? > > >> On Tue, Dec 29, 2015 at 9:55 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>

Re: [ANNOUNCE] Spark 1.6.0 Release Preview

2015-11-24 Thread Ted Yu
If I am not mistaken, the binaries for Scala 2.11 were generated against hadoop 1. What about binaries for Scala 2.11 against hadoop 2.x ? Cheers On Sun, Nov 22, 2015 at 2:21 PM, Michael Armbrust wrote: > In order to facilitate community testing of Spark 1.6.0, I'm

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Ted Yu
Should a new job be setup under Spark-Master-Maven-with-YARN for hadoop 2.6.x ? Cheers On Thu, Nov 19, 2015 at 5:16 PM, 张志强(旺轩) wrote: > I agreed > +1 > > -- > 发件人:Reynold Xin > 日

Re: Spark 1.6.0 and HDP 2.2 - problem

2016-01-13 Thread Ted Yu
I would suggest trying option #1 first. Thanks > On Jan 13, 2016, at 2:12 AM, Maciej Bryński wrote: > > Hi, > I/m trying to run Spark 1.6.0 on HDP 2.2 > Everything was fine until I tried to turn on dynamic allocation. > According to instruction I need to add shuffle service

Re: Dependency on TestingUtils in a Spark package

2016-01-12 Thread Ted Yu
There is no annotation in TestingUtils class indicating whether it is suitable for consumption by external projects. You should assume the class is not public since its methods may change in future Spark releases. Cheers On Tue, Jan 12, 2016 at 12:36 PM, Robert Dodier

Re: Tungsten in a mixed endian environment

2016-01-12 Thread Ted Yu
I logged SPARK-12778 where endian awareness in Platform.java should help in mixed endian set up. There could be other parts of the code base which are related. Cheers On Tue, Jan 12, 2016 at 7:01 AM, Adam Roberts wrote: > Hi all, I've been experimenting with DataFrame

Re: Kryo registration for Tuples?

2016-06-08 Thread Ted Yu
I think the second group (3 classOf's) should be used. Cheers On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov wrote: > if my RDD is RDD[(String, (Long, MyClass))] > > Do I need to register > > classOf[MyClass] > classOf[(Any, Any)] > > or > > classOf[MyClass] >

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
Please go ahead. On Tue, Jun 7, 2016 at 4:45 PM, franklyn wrote: > Thanks for reproducing it Ted, should i make a Jira Issue?. > > > > -- > View this message in context: >

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
I built with Scala 2.10 >>> df.select(add_one(df.a).alias('incremented')).collect() The above just hung. On Tue, Jun 7, 2016 at 3:31 PM, franklyn wrote: > Thanks Ted !. > > I'm using > >

Re: Dataset API agg question

2016-06-07 Thread Ted Yu
Have you tried the following ? Seq(1->2, 1->5, 3->6).toDS("a", "b") then you can refer to columns by name. FYI On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov wrote: > I'm trying to switch from RDD API to Dataset API > My question is about reduceByKey method > >

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
With commit 200f01c8fb15680b5630fbd122d44f9b1d096e02 using Scala 2.11: Using Python version 2.7.9 (default, Apr 29 2016 10:48:06) SparkSession available as 'spark'. >>> from pyspark.sql import SparkSession >>> from pyspark.sql.types import IntegerType, StructField, StructType >>> from

Re: Can't compile 2.0-preview with scala 2.10

2016-06-06 Thread Ted Yu
See the following from https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.10/1642/consoleFull : + SBT_FLAGS+=('-Dscala-2.10') + ./dev/change-scala-version.sh 2.10 FYI On Mon, Jun 6, 2016 at 10:35 AM, Franklyn D'souza <

Re: Welcoming Yanbo Liang as a committer

2016-06-03 Thread Ted Yu
Congratulations, Yanbo. On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia wrote: > Hi all, > > The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a > super active contributor in many areas of MLlib. Please join me in > welcoming Yanbo! > > Matei >

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Ted Yu
Docker Integration Tests failed on Linux: http://pastebin.com/Ut51aRV3 Here was the command I used: mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.0 package Has anyone seen similar error ? Thanks On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin

Re: Building Spark with Custom Hadoop Version

2016-02-04 Thread Ted Yu
Assuming your change is based on hadoop-2 branch, you can use 'mvn install' command which would put artifacts under 2.8.0-SNAPSHOT subdir in your local maven repo. Here is an example: ~/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.8.0-SNAPSHOT Then you can use the following command to build

Re: Welcoming two new committers

2016-02-08 Thread Ted Yu
Congratulations, Herman and Wenchen. On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia wrote: > Hi all, > > The PMC has recently added two new Spark committers -- Herman van Hovell > and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten, > adding new

Re: Error aliasing an array column.

2016-02-09 Thread Ted Yu
Do you mind pastebin'ning code snippet and exception one more time - I couldn't see them in your original email. Which Spark release are you using ? On Tue, Feb 9, 2016 at 11:55 AM, rakeshchalasani wrote: > Hi All: > > I am getting an "UnsupportedOperationException" when

Re: Error aliasing an array column.

2016-02-09 Thread Ted Yu
ethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.re

Re: Error aliasing an array column.

2016-02-09 Thread Ted Yu
> > ++ > |arrayCol| > ++ > | [0, 1]| > | [1, 2]| > | [2, 3]| > | [3, 4]| > | [4, 5]| > | [5, 6]| > | [6, 7]| > | [7, 8]| > | [8, 9]| > | [9, 10]| > ++ > > > > On Tue, Feb 9, 2016 at 4:52 PM

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
Hdfs class is in hadoop-hdfs-XX.jar Can you check the classpath to see if the above jar is there ? Please describe the command lines you used for building hadoop / Spark. Cheers On Thu, Feb 11, 2016 at 5:15 PM, Charlie Wright wrote: > I am having issues trying to run a

Re: 回复: Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Ted Yu
gnoreCase("string"); > > String tsColName = null; > if (iTimestamp >= 0) { > tsColName = > jobConf.get(serdeConstants.LIST_COLUMNS).split(",")[iTimestamp]; > } > > > > -- 原始邮件 ------ > *发件人:* "Jörn Fran

Re: build error: code too big: specialStateTransition(int, IntStream)

2016-01-28 Thread Ted Yu
After this change: [SPARK-12681] [SQL] split IdentifiersParser.g into two files the biggest file under sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser is SparkSqlParser.g Maybe split SparkSqlParser.g up as well ? On Thu, Jan 28, 2016 at 5:21 AM, Iulian Dragoș

Re: Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Ted Yu
For the last two problems, hbase-site.xml seems not to be on classpath. Once hbase-site.xml is put on classpath, you should be able to make progress. Cheers > On Jan 28, 2016, at 1:14 AM, Maciej Bryński wrote: > > Hi, > I'm trying to run SQL query on Hive table which is

Re: Secure multi tenancy on in stand alone mode

2016-02-01 Thread Ted Yu
w.r.t. running Spark on YARN, there are a few outstanding issues. e.g. SPARK-11182 HDFS Delegation Token See also the comments under SPARK-12279 FYI On Mon, Feb 1, 2016 at 1:02 PM, eugene miretsky wrote: > When having multiple users sharing the same Spark cluster,

Re: Encrypting jobs submitted by the client

2016-02-02 Thread Ted Yu
For #1, a brief search landed the following: core/src/main/scala/org/apache/spark/SparkConf.scala: DeprecatedConfig("spark.rpc", "2.0", "Not used any more.") core/src/main/scala/org/apache/spark/SparkConf.scala: "spark.rpc.numRetries" -> Seq(

Re: Scala 2.11 default build

2016-01-30 Thread Ted Yu
Does this mean the following Jenkins builds can be disabled ? https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-MAVEN-SCALA-2.11/ https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.11/ Cheers On Sat, Jan

Re: Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Ted Yu
w.r.t. protobuf-java version mismatch, I wonder if you can rebuild Spark with the following change (using maven): http://pastebin.com/fVQAYWHM Cheers On Sat, Jan 30, 2016 at 12:49 AM, Yash Sharma wrote: > Hi All, > I have a quick question if anyone has experienced this

Re: Scala 2.11 default build

2016-02-01 Thread Ted Yu
The following jobs have been established for build against Scala 2.10: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-MAVEN-SCALA-2.10/ https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.10/ FYI On

Re: Spark log4j fully qualified class name

2016-02-27 Thread Ted Yu
Looking at https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html *WARNING* Generating the caller class information is slow. Thus, use should be avoided unless execution speed is not an issue. On Sat, Feb 27, 2016 at 12:40 PM, Prabhu Joseph

Re: 回复: a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Ted Yu
The referenced benchmark is in Chinese. Please provide English version so that more people can understand. For item 7, looks like the speed of ingest is much slower compared to using Parquet. Cheers On Mon, Feb 22, 2016 at 6:12 AM, 开心延年 wrote: > 1.ya100 is not only the

Re: Hbase in spark

2016-02-26 Thread Ted Yu
In hbase, there is hbase-spark module which supports bulk load. This module is to be backported in the upcoming 1.3.0 release. There is some pending work, such as HBASE-15271 . FYI On Fri, Feb 26, 2016 at 8:50 AM, Renu Yadav wrote: > Has anybody implemented bulk load into

Re: Opening a JIRA for QuantileDiscretizer bug

2016-02-22 Thread Ted Yu
When you click on Create, you're brought to 'Create Issue' dialog where you choose Project Spark. Component should be MLlib. Please see also: http://search-hadoop.com/m/q3RTtmsshe1W6cH22/spark+pull+template=pull+request+template On Mon, Feb 22, 2016 at 6:45 PM, Pierson, Oliver C

Re: timeout in shuffle problem

2016-01-24 Thread Ted Yu
Cycling past bits: http://search-hadoop.com/m/q3RTtU5CRU1KKVA42=RE+shuffle+FetchFailedException+in+spark+on+YARN+job On Sun, Jan 24, 2016 at 5:52 AM, wangzhenhua (G) wrote: > Hi, > > I have a problem of time out in shuffle, it happened after shuffle write > and at the

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
com> wrote: > Looks like the other packages may also be corrupt. I’m getting the same > error for the Spark 1.6.1 / Hadoop 2.4 package. > > > https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz > > Nick > ​ > > On Wed, Mar 16, 2016 at 8:28 P

Re: OOM and "spark.buffer.pageSize"

2016-03-28 Thread Ted Yu
I guess you have looked at MemoryManager#pageSizeBytes where the "spark.buffer.pageSize" config can override default page size. FYI On Mon, Mar 28, 2016 at 12:07 PM, Steve Johnston < sjohns...@algebraixdata.com> wrote: > I'm attempting to address an OOM issue. I saw referenced in >

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Ted Yu
@gmail.com> wrote: >>> >> An additional note: The Spark packages being served off of CloudFront >>> (i.e. >>> >> the “direct download” option on spark.apache.org) are also corrupt. >>> >> >>> >> Btw what’s the correct way

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Ted Yu
Raymond: Did "namenode" appear in any of the Spark config files ? BTW Scala 2.11 is used by the default build. On Tue, Apr 5, 2016 at 6:22 AM, Raymond Honderdors < raymond.honderd...@sizmek.com> wrote: > I can see that the build is successful > > (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-11 Thread Ted Yu
Gentle ping: spark-1.6.1-bin-hadoop2.4.tgz from S3 is still corrupt. On Wed, Apr 6, 2016 at 12:55 PM, Josh Rosen <joshro...@databricks.com> wrote: > Sure, I'll take a look. Planning to do full verification in a bit. > > On Wed, Apr 6, 2016 at 12:54 PM Ted Yu <yuzhih..

Re: spark graphx storage RDD memory leak

2016-04-10 Thread Ted Yu
I see the following code toward the end of the method: // Unpersist the RDDs hidden by newly-materialized RDDs oldMessages.unpersist(blocking = false) prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = false) Wouldn't the above achieve same effect

Re: [BUILD FAILURE] Spark Project ML Local Library - me or it's real?

2016-04-09 Thread Ted Yu
The broken build was caused by the following: [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/607/ FYI On Sat, Apr 9, 2016 at 12:01 PM, Jacek Laskowski wrote: > Hi, > > Is this

Re: [BUILD FAILURE] Spark Project ML Local Library - me or it's real?

2016-04-09 Thread Ted Yu
Sent PR: https://github.com/apache/spark/pull/12276 I was able to get build going past mllib-local module. FYI On Sat, Apr 9, 2016 at 12:40 PM, Ted Yu <yuzhih...@gmail.com> wrote: > The broken build was caused by the following: > > [SPARK-14462][ML][MLLIB] add the mllib-local

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

2016-04-05 Thread Ted Yu
The next line should give some clue: expectCorrectException { ssc.transform(Seq(ds), transformF) } Closure shouldn't include return. On Tue, Apr 5, 2016 at 3:40 PM, Jacek Laskowski wrote: > Hi, > > In >

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Ted Yu
Josh: You may have noticed the following error ( https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console ): [error] javac: invalid source release: 1.8 [error] Usage: javac [error] use -help for a list of possible options On Tue, Apr 5, 2016 at 2:14 PM, Josh

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu
Looking at recent https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7 builds, there was no such error. I don't see anything wrong with the code: usage = "_FUNC_(str) - " + "Returns str, with the first letter of each word in uppercase, all other letters in " + Mind

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu
k Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Apr 5, 2016 at 8:41 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Looking at recent > &g

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
On Linux, I got: $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > >

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
azonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz >> >> Nick >> ​ >> >> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote: >> >>> On Linux, I got: >>> >>> $ tar zxf spark-1.6.1-bin-hadoop2.6

Re: error occurs to compile spark 1.6.1 using scala 2.11.8

2016-03-22 Thread Ted Yu
>From the error message, it seems some artifacts from Scala 2.10.4 were left around. FYI maven 3.3.9 is required for master branch. On Tue, Mar 22, 2016 at 3:07 AM, Allen wrote: > Hi, > > I am facing an error when doing compilation from IDEA, please see the > attached. I

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Ted Yu
Do you have performance numbers to backup this proposal for cogroup operation ? Thanks On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com> wrote: > Hello devs, > > > > I have found myself in a situation where Spark is doing sub-optimal >

Re: BlockManager WARNINGS and ERRORS

2016-03-27 Thread Ted Yu
The warning was added by: SPARK-12757 Add block-level read/write locks to BlockManager On Sun, Mar 27, 2016 at 12:24 PM, salexln wrote: > HI all, > > I started testing my code (https://github.com/salexln/FinalProject_FCM) > with the latest Spark available in GitHub, > and

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Ted Yu
Josh: SerializerInstance and SerializationStream would also become private[spark], right ? Thanks On Mon, Mar 7, 2016 at 6:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer interface > (org.apache.spark.serializer.Serializer) in your own third-party

Re: Set up a Coverity scan for Spark

2016-03-04 Thread Ted Yu
Since majority of code is written in Scala which is not analyzed by Coverity, the efficacy of the tool seems limited. > On Mar 4, 2016, at 2:34 AM, Sean Owen wrote: > > https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview > >

Re: Set up a Coverity scan for Spark

2016-03-04 Thread Ted Yu
va code. I'm not suggesting anyone run it regularly, > but one run to catch some bugs is useful. > > I've already triaged ~70 issues there just in the Java code, of which > a handful are important. > > On Fri, Mar 4, 2016 at 12:18 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >

Re: Set up a Coverity scan for Spark

2016-03-04 Thread Ted Yu
gt; - bad equals/hashCode > > On Fri, Mar 4, 2016 at 2:52 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Last time I checked there wasn't high impact defects. > > > > Mind pointing out the defects you think should be fixed ? > > > > Thanks > > >

Re: Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Ted Yu
Please stack trace, code snippet, etc in the JIRA you created so that people can reproduce what you saw. On Sat, Mar 5, 2016 at 7:02 AM, Dhaval Modi wrote: > > Regards, > Dhaval Modi > dhavalmod...@gmail.com > > -- Forwarded message -- > From: Dhaval Modi

explain codegen

2016-04-03 Thread Ted Yu
Hi, Based on master branch refreshed today, I issued 'git clean -fdx' first. Then this command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.0 package -DskipTests I got the following error: scala> sql("explain codegen select 'a' as a group by 1").head

Re: explain codegen

2016-04-04 Thread Ted Yu
; > /* 011 */ ... > > > On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi, >> >> Looks related to the recent commit... >> >> Repository: spark >> Updated Branches: >> refs/heads/master 2262a9335 -> 1f

Re: explain codegen

2016-04-04 Thread Ted Yu
; > Herman van Hövell > > 2016-04-04 12:15 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > >> Could the error I encountered be due to missing import(s) of implicit ? >> >> Thanks >> >> On Sun, Apr 3, 2016 at 9:42 PM, Reynold Xin <r...@databricks.com> w

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

Re: error: reference to sql is ambiguous after import org.apache.spark._ in shell?

2016-04-04 Thread Ted Yu
Looks like the import comes from repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala : processLine("import sqlContext.sql") On Mon, Apr 4, 2016 at 5:16 PM, Jacek Laskowski wrote: > Hi Spark devs, > > I'm unsure if what I'm seeing is correct. I'd

Re: explain codegen

2016-04-04 Thread Ted Yu
wrote: > Why don't you wipe everything out and try again? > > On Monday, April 4, 2016, Ted Yu <yuzhih...@gmail.com> wrote: > >> The commit you mentioned was made Friday. >> I refreshed workspace Sunday - so it was included. >> >> Maybe this was related: >

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Ted Yu
ime worked. Could it be that there is some load balancer/cache in >>> >> front of the archive and some nodes still serve the corrupt packages? >>> >> >>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas >>> >> <nicholas.cham...@gmai

Re: Cache Shuffle Based Operation Before Sort

2016-04-25 Thread Ted Yu
Interesting. bq. details of execution for 10 and 100 scale factor input Looks like some chart (or image) didn't go through. FYI On Mon, Apr 25, 2016 at 12:50 PM, Ali Tootoonchian wrote: > Caching shuffle RDD before the sort process improves system performance. > SQL > planner

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread Ted Yu
Interesting analysis. Can you log a JIRA ? > On Apr 21, 2016, at 11:07 AM, atootoonchian wrote: > > SQL query planner can have intelligence to push down filter commands towards > the storage layer. If we optimize the query planner such that the IO to the > storage is reduced

Re: RFC: Remote "HBaseTest" from examples?

2016-04-21 Thread Ted Yu
Zhan: I have mentioned the JIRA numbers in the thread starting with (note the typo in subject of this thread): RFC: Remove ... On Thu, Apr 21, 2016 at 1:28 PM, Zhan Zhang wrote: > FYI: There are several pending patches for DataFrame support on top of > HBase. > >

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu
er.ula...@hpe.com > wrote: > Hi Ted, > > > > I have 36 files of size ~600KB and the rest 74 are about 400KB. > > > > Is there a workaround rather than changing Sparks code? > > > > Best regards, Alexander > > > > *From:* Ted Yu [mailto:yu

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu
Here is the body of StreamFileInputFormat#setMinPartitions : def setMinPartitions(context: JobContext, minPartitions: Int) { val totalLen = listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong

Re: Running TPCDSQueryBenchmark results in java.lang.OutOfMemoryError

2016-05-23 Thread Ted Yu
Can you tell us the commit hash using which the test was run ? For #2, if you can give full stack trace, that would be nice. Thanks On Mon, May 23, 2016 at 8:58 AM, Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote: > Hi > > 1) Using latest spark 2.0 I've managed to run

Re: Quick question on spark performance

2016-05-20 Thread Ted Yu
Yash: Can you share the JVM parameters you used ? How many partitions are there in your data set ? Thanks On Fri, May 20, 2016 at 5:59 PM, Reynold Xin wrote: > It's probably due to GC. > > On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > >> Hi

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu
t; >>> Thank you, Steve and Hyukjin. >>> >>> And, don't worry, Ted. >>> >>> Travis launches new VMs for every PR. >>> >>> Apache Spark repository uses the following setting. >>> >>> VM: Google Compute Engine >&g

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Ted Yu
`dev/lint-java`. > - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`. > > Thank you, Ted. > > Dongjoon. > > On Sun, May 22, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> The following line was repeated twice: >> >> - For Oracle JDK7,

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Ted Yu
The following line was repeated twice: - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`. Did you intend to cover JDK 8 ? Cheers On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun wrote: > Hi, All. > > I want to propose the followings. > > - Turn on Travis

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu
mend your commit title or messages, see the Travis CI. > Or, you can monitor Travis CI result on status menu bar. > If it shows green icon, you have nothing to do. > >https://docs.travis-ci.com/user/apps/ > > To sum up, I think we don't need to wait for any CIs.

Re: ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Ted Yu
Please log a JIRA. Thanks On Tue, May 24, 2016 at 8:33 AM, Koert Kuipers wrote: > hello, > as we continue to test spark 2.0 SNAPSHOT in-house we ran into the > following trying to port an existing application from spark 1.6.1 to spark > 2.0.0-SNAPSHOT. > > given this code: >

Re: Structured Streaming with Kafka source/sink

2016-05-11 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/q3RTt9XAz651PiG/Adhoc+queries+spark+streaming=Re+Adhoc+queries+on+Spark+2+0+with+Structured+Streaming > On May 11, 2016, at 1:47 AM, Ofir Manor wrote: > > Hi, > I'm trying out Structured Streaming from current 2.0

Re: Cache Shuffle Based Operation Before Sort

2016-05-08 Thread Ted Yu
I assume there were supposed to be images following this line (which I don't see in the email thread): bq. Let’s look at details of execution for 10 and 100 scale factor input Consider using 3rd party image site. On Sun, May 8, 2016 at 5:17 PM, Ali Tootoonchian wrote: > Thanks

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a

Re: Query parsing error for the join query between different database

2016-05-18 Thread Ted Yu
Which release of Spark / Hive are you using ? Cheers > On May 18, 2016, at 6:12 AM, JaeSung Jun wrote: > > Hi, > > I'm working on custom data source provider, and i'm using fully qualified > table name in FROM clause like following : > > SELECT user. uid, dept.name > FROM

Re: Proposal of closing some PRs and maybe some PRs abandoned by its author

2016-05-06 Thread Ted Yu
PR #10572 was listed twice. In the future, is it possible to include the contributor's handle beside the PR number so that people can easily recognize their own PR ? Thanks On Fri, May 6, 2016 at 8:45 AM, Hyukjin Kwon wrote: > Hi all, > > > This was similar with the

Re: spark 2 segfault

2016-05-02 Thread Ted Yu
lt;ko...@tresata.com> wrote: > >> Created issue: >> https://issues.apache.org/jira/browse/SPARK-15062 >> >> On Mon, May 2, 2016 at 6:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> I tried the same statement using Spark 1.6.1 >>> There was no e

Re: spark 2 segfault

2016-05-02 Thread Ted Yu
gt; showed earlier. > >> On May 2, 2016 12:09 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: >> Using commit hash 90787de864b58a1079c23e6581381ca8ffe7685f and Java 1.7.0_67 >> , I got: >> >> scala> val dfComplicated = sc.parallelize(List((Map(&quo

Re: SQLContext and "stable identifier required"

2016-05-03 Thread Ted Yu
Have you tried the following ? scala> import spark.implicits._ import spark.implicits._ scala> spark res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@323d1fa2 Cheers On Tue, May 3, 2016 at 9:16 AM, Koert Kuipers wrote: > with the introduction of

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
ks to prints I've added > > Cheers, > > > > > From:Ted Yu <yuzhih...@gmail.com> > To:Adam Roberts/UK/IBM@IBMGB > Cc:"dev@spark.apache.org" <dev@spark.apache.org> > Date:15/04/20

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
turning: " + unaligned);* > } > } > > Output is, as you'd expect, "used reflection and _unaligned is false, > setting to true anyway for experimenting", and the tests pass. > > No other problems on the platform (pending a different pull request). > &g

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
I assume you tested 2.0 with SPARK-12181 . Related code from Platform.java if java.nio.Bits#unaligned() throws exception: // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
sometimes mistake happen but the cost to reopen is approximately zero (i.e. > click a button on the pull request). > > > On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> bq. close the ones where they don't respond for a week >> >>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
about if we filtered for non-mergeable PRs or instead left a comment > asking the author to respond if they are still available to move the PR > forward - and close the ones where they don't respond for a week? > > Just a suggestion. > On Monday, April 18, 2016, Ted Yu <yuzhih...@gm

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
you were referring to the cost of closing the pull request, >>> and you >>> > are assuming people look at the pull requests that have been inactive >>> for a >>> > long time. That seems equally likely (or unlikely) as committers >>> looking at

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
ep it within Spark? > > On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> bq. HBase's current support, even if there are bugs or things that still >> need to be done, is much better than the Spark example >> >> In my opinion, a simpl

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
of HBase's input > formats), which makes it not very useful as a blueprint for developing > HBase apps with Spark. > > On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > bq. I wouldn't call it "incomplete". > > > > I would call i

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
; On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> bq. it's actually in use right now in spite of not being in any upstream >> HBase release >> >> If it is not in upstream, then it is not relevant for discussion on >> Apache mai

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
van...@cloudera.com> wrote: > On Tue, Apr 19, 2016 at 11:07 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> The same question can be asked w.r.t. examples for other projects, such >> as flume and kafka. >> > > The main difference being that flume and kafka int

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
r 19, 2016 at 10:23 AM, Marcelo Vanzin <van...@cloudera.com> wrote: > On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > I want to note that the hbase-spark module in HBase is incomplete. Zhan > has > > several patches pending review. &

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
bq. create a separate tarball for them Probably another thread can be started for the above. I am fine with it. On Tue, Apr 19, 2016 at 10:34 AM, Marcelo Vanzin wrote: > On Tue, Apr 19, 2016 at 10:28 AM, Reynold Xin wrote: > > Yea in general I feel

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu
gs too > many dependencies for something that is not really useful, is why I'm > suggesting removing it. > > > On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > There is an Open JIRA for fixing the documentation: HBASE-15473 > > > > I wo

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Ted Yu
; unaligned memory access on a platform where unaligned memory access is > definitely not supported for shorts/ints/longs. > > if these tests continue to pass then I think the Spark tests don't > exercise unaligned memory access, cheers > > > > > > > > From:Ted

Re: Improving system design logging in spark

2016-04-20 Thread Ted Yu
Interesting. For #3: bq. reading data from, I guess you meant reading from disk. On Wed, Apr 20, 2016 at 10:45 AM, atootoonchian wrote: > Current spark logging mechanism can be improved by adding the following > parameters. It will help in understanding system bottlenecks and

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
I had one PR which got merged after 3 months. If the inactivity was due to contributor, I think it can be closed after 30 days. But if the inactivity was due to lack of review, the PR should be kept open. On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger wrote: > For what

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
ess; more committers; a higher barrier to contributing; a combination >> thereof; etc... >> >> Also relevant: http://danluu.com/discourage-oss/ >> >> By the way, some people noted that closing PRs may discourage >> contributors. I think our open PR count alon

Re: Build speed

2016-07-22 Thread Ted Yu
I assume you have enabled Zinc. Cheers On Fri, Jul 22, 2016 at 7:54 AM, Mikael Ståldal wrote: > Is there any way to speed up an incremental build of Spark? > > For me it takes 8 minutes to build the project with just a few code > changes. > > -- > [image: MagineTV] >

<    1   2   3   4   >