RE: Sort Merge Join from the filesystem

2015-11-04 Thread Cheng, Hao
Yes, we probably need more change for the data source API if we need to implement it in a generic way. BTW, I create the JIRA by copy most of words from Alex. ☺ https://issues.apache.org/jira/browse/SPARK-11512 From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, November 5, 2015

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Egor Pahomov
+1 Things, which our infrastructure use and I checked: Dynamic allocation Spark ODBC server Reading json Writing parquet SQL quires (hive context) Running on CDH 2015-11-04 9:03 GMT-08:00 Sean Owen : > As usual the signatures and licenses and so on look fine. I continue >

Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Not sure the reason, it seems LibSVMRelation and CsvRelation can extends HadoopFsRelation and leverage the features from HadoopFsRelation. Any other consideration for that ? -- Best Regards Jeff Zhang

RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on the ETL work? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Thursday, November 5, 2015 12:56 PM To: gen tang; dev@spark.apache.org Subject: RE: dataframe slow down with tungsten turn on 1.5 has critical

RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
Probably 2 reasons: 1. HadoopFsRelation was introduced since 1.4, but seems CsvRelation was created based on 1.3 2. HadoopFsRelation introduces the concept of Partition, which probably not necessary for LibSVMRelation. But I think it will be easy to change as extending from

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Thanks Hao. I have ready made it extends HadoopFsRelation and it works. Will create a jira for that. Besides that, I noticed that in DataSourceStrategy, spark build physical plan based on the trait of the BaseRelation in pattern matching (e.g. CatalystScan, TableScan, HadoopFsRelation). That

RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
I think you’re right, we do offer the opportunity for developers to make mistakes while implementing the new Data Source. Here we assume that the new relation MUST NOT extends more than one trait of the CatalystScan, TableScan, PrunedScan, PrunedFilteredScan , etc. otherwise it will causes

pyspark with pypy not work for spark-1.5.1

2015-11-04 Thread Chang Ya-Hsuan
Hi all, I am trying to run pyspark with pypy, and it is work when using spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1 my pypy version: $ /usr/bin/pypy --version Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40) [PyPy 2.2.1 with GCC 4.8.4] works with spark-1.3.1 $

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Thanks Owen. Will do it On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen wrote: > I'm pretty sure that attribute is required. I am not sure what PMML > version the code has been written for but would assume 4.2.1. Feel > free to open a PR to add this version to all the output. > >

Re: Codegen In Shuffle

2015-11-04 Thread 牛兆捷
I see. Thanks very much. 2015-11-04 16:25 GMT+08:00 Reynold Xin : > GenerateUnsafeProjection -- projects any internal row data structure > directly into bytes (UnsafeRow). > > > On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > >> Dear all: >> >> Tungsten

Re: PMML version in MLLib

2015-11-04 Thread Sean Owen
I'm pretty sure that attribute is required. I am not sure what PMML version the code has been written for but would assume 4.2.1. Feel free to open a PR to add this version to all the output. On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem wrote: > [adding dev] > > On Wed, Nov

Re: Master build fails ?

2015-11-04 Thread Jacek Laskowski
Hi, It appears it's time to switch to my lovely sbt then! Pozdrawiam, Jacek -- Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl Follow me at https://twitter.com/jaceklaskowski Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski On Tue, Nov 3, 2015 at 2:58

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Heller, Chris
We’ve been making use of both. Fine-grain mode makes sense for more ad-hoc work loads, and coarse-grained for more job like loads on a common data set. My preference is the fine-grain mode in all cases, but the overhead associated with its startup and the possibility that an overloaded cluster

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Jean-Baptiste Onofré
+1 (non binding) Just tested with some snippets on my side. Regards JB On 11/04/2015 12:22 AM, Reynold Xin wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of at least 3 +1 PMC

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data

Codegen In Shuffle

2015-11-04 Thread 牛兆捷
Dear all: Tungsten project has mentioned that they are applying code generation is to speed up the conversion of data from in-memory binary format to wire-protocol for shuffle. Where can I find the related implementation in spark code-based ? -- *Regards,* *Zhaojie*