pyspark with pypy not work for spark-1.5.1

2015-11-04 Thread Chang Ya-Hsuan
Hi all, I am trying to run pyspark with pypy, and it is work when using spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1 my pypy version: $ /usr/bin/pypy --version Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40) [PyPy 2.2.1 with GCC 4.8.4] works with spark-1.3.1 $ PYSP

RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
I think you’re right, we do offer the opportunity for developers to make mistakes while implementing the new Data Source. Here we assume that the new relation MUST NOT extends more than one trait of the CatalystScan, TableScan, PrunedScan, PrunedFilteredScan , etc. otherwise it will causes prob

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Thanks Hao. I have ready made it extends HadoopFsRelation and it works. Will create a jira for that. Besides that, I noticed that in DataSourceStrategy, spark build physical plan based on the trait of the BaseRelation in pattern matching (e.g. CatalystScan, TableScan, HadoopFsRelation). That means

RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
Probably 2 reasons: 1. HadoopFsRelation was introduced since 1.4, but seems CsvRelation was created based on 1.3 2. HadoopFsRelation introduces the concept of Partition, which probably not necessary for LibSVMRelation. But I think it will be easy to change as extending from HadoopFsR

RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on the ETL work? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Thursday, November 5, 2015 12:56 PM To: gen tang; dev@spark.apache.org Subject: RE: dataframe slow down with tungsten turn on 1.5 has critical perform

RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc version. From: gen tang [mailto:gen.tan...@gmail.com] Sent: Thursday, November 5, 2015 12:43 PM To: dev@spark.apache.org Subject: Fwd: dataframe slow down with tungsten turn on Hi, In fact, I tested the same code with

Fwd: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Hi, In fact, I tested the same code with spark 1.5 with tungsten turning off. The result is quite the same as tungsten turning on. It seems that it is not the problem of tungsten, it is simply that spark 1.5 is slower than spark 1.4. Is there any idea about why it happens? Thanks a lot in advance

Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Not sure the reason, it seems LibSVMRelation and CsvRelation can extends HadoopFsRelation and leverage the features from HadoopFsRelation. Any other consideration for that ? -- Best Regards Jeff Zhang

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Egor Pahomov
+1 Things, which our infrastructure use and I checked: Dynamic allocation Spark ODBC server Reading json Writing parquet SQL quires (hive context) Running on CDH 2015-11-04 9:03 GMT-08:00 Sean Owen : > As usual the signatures and licenses and so on look fine. I continue > to get the same test

RE: Sort Merge Join from the filesystem

2015-11-04 Thread Cheng, Hao
Yes, we probably need more change for the data source API if we need to implement it in a generic way. BTW, I create the JIRA by copy most of words from Alex. ☺ https://issues.apache.org/jira/browse/SPARK-11512 From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, November 5, 2015 1:36

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Charmee Patel
Due to other reasons we are using spark sql, not dataframe api. I saw that broadcast hint is only available on dataframe api. On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin wrote: > Can you use the broadcast hint? > > e.g. > > df1.join(broadcast(df2)) > > the broadcast function is in org.apache.spark

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has statistics of that t

How to force statistics calculation of Dataframe?

2015-11-04 Thread Charmee Patel
Hi, If I have a hive table, analyze table compute statistics will ensure Spark SQL has statistics of that table. When I have a dataframe, is there a way to force spark to collect statistics? I have a large lookup file and I am trying to avoid a broadcast join by applying a filter before hand. Thi

Re: Sort Merge Join from the filesystem

2015-11-04 Thread Reynold Xin
It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either. On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > (this is kind of a cross-post from the user list) > > Does Spark support doi

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Heller, Chris
Correct. Its just that with coarse mode we grab the resources up front, so its either available or not. But using resources on demand, as with a fine grained mode, just means the potential to starve out an individual job. There is also the sharing of RDDs that coarse gives you which would need s

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Sean Owen
As usual the signatures and licenses and so on look fine. I continue to get the same test failures on Ubuntu in Java 7/8: - Unpersisting HttpBroadcast on executors only in distributed mode *** FAILED *** But I continue to assume that's specific to tests and/or Ubuntu and/or the build profile, sin

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Iulian Dragoș
Probably because only coarse-grained mode respects `spark.cores.max` right now. See (and maybe review ;-)) #9027 (sorry for the shameless plug). iulian On Wed, Nov 4, 2015 at 5:05 PM, Timothy Chen wrote: > Hi Chris, > > How does coarse grain mode give

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Timothy Chen
Hi Chris, How does coarse grain mode gives you less starvation in your overloaded cluster? Is it just because it allocates all resources at once (which I think in a overloaded cluster allows less things to run at once). Tim > On Nov 4, 2015, at 4:21 AM, Heller, Chris wrote: > > We’ve been m

Re: Unchecked contribution (JIRA and PR)

2015-11-04 Thread Sergio Ramírez
OK, for me, time is not a problem. I was just worried about there was no movement in those issues. I think they are good contributions. For example, I have found no complex discretization algorithm in MLlib, which is rare. My algorithm, a Spark implementation of the well-know discretizer develo

Re: Build a specific module only

2015-11-04 Thread gsvic
Ok, thank you very much Ted. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Build-a-specific-module-only-tp14943p14953.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Sort Merge Join from the filesystem

2015-11-04 Thread Alex Nastetsky
(this is kind of a cross-post from the user list) Does Spark support doing a sort merge join on two datasets on the file system that have already been partitioned the same with the same number of partitions and sorted within each partition, without needing to repartition/sort them again? This fun

Re: Build a specific module only

2015-11-04 Thread Ted Yu
Please take a look at https://issues.apache.org/jira/browse/SPARK-10883 > On Nov 4, 2015, at 3:27 AM, gsvic wrote: > > Is it possible to build a specific spark module without building the whole > project? > > For example, I am trying to build sql-core project by > > /build/mvn -pl sql/core ins

Re: Codegen In Shuffle

2015-11-04 Thread 牛兆捷
I see. Thanks very much. 2015-11-04 16:25 GMT+08:00 Reynold Xin : > GenerateUnsafeProjection -- projects any internal row data structure > directly into bytes (UnsafeRow). > > > On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > >> Dear all: >> >> Tungsten project has mentioned that they are applying

Looking for the method executors uses to write to HDFS

2015-11-04 Thread Tóth Zoltán
Hi, I'd like to write a parquet file from the driver. I could use the HDFS API but I am worried that it won't work on a secure cluster. I assume that the method the executors use to write to HDFS takes care of managing Hadoop security. However, I can't find the place where HDFS write happens in th

Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Heller, Chris
We’ve been making use of both. Fine-grain mode makes sense for more ad-hoc work loads, and coarse-grained for more job like loads on a common data set. My preference is the fine-grain mode in all cases, but the overhead associated with its startup and the possibility that an overloaded cluster w

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Thanks Owen. Will do it On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen wrote: > I'm pretty sure that attribute is required. I am not sure what PMML > version the code has been written for but would assume 4.2.1. Feel > free to open a PR to add this version to all the output. > > On Wed, Nov 4, 2015 a

Re: Master build fails ?

2015-11-04 Thread Jacek Laskowski
Hi, It appears it's time to switch to my lovely sbt then! Pozdrawiam, Jacek -- Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl Follow me at https://twitter.com/jaceklaskowski Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski On Tue, Nov 3, 2015 at 2:58 PM

Re: PMML version in MLLib

2015-11-04 Thread Sean Owen
I'm pretty sure that attribute is required. I am not sure what PMML version the code has been written for but would assume 4.2.1. Feel free to open a PR to add this version to all the output. On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem wrote: > [adding dev] > > On Wed, Nov 4, 2015 at 2:27 PM,

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
[adding dev] On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem wrote: > I just went through all specifications, and they expect the version > attribute. This should be addressed very soon because if we cannot use the > PMML model without the version attribute, there is no use of generating one > wit

Build a specific module only

2015-11-04 Thread gsvic
Is it possible to build a specific spark module without building the whole project? For example, I am trying to build sql-core project by /build/mvn -pl sql/core install -DskipTests/ and I am getting the following error: /[error] bad symbolic reference. A signature in WebUI.class refers to term

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Jean-Baptiste Onofré
+1 (non binding) Just tested with some snippets on my side. Regards JB On 11/04/2015 12:22 AM, Reynold Xin wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of at least 3 +1 PMC vo

Re: Getting new metrics into /api/v1

2015-11-04 Thread Charles Yeh
Never mind, I didn't realize building the core module separately doesn't update the main jar that the scripts use. Will look into making the history server consistent. On Tue, Nov 3, 2015 at 10:24 PM, Charles Yeh wrote: > Hello, > > I'm trying to get maxCores and memoryPerExecutorMB into /api/v

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data from in-memory binary f

Codegen In Shuffle

2015-11-04 Thread 牛兆捷
Dear all: Tungsten project has mentioned that they are applying code generation is to speed up the conversion of data from in-memory binary format to wire-protocol for shuffle. Where can I find the related implementation in spark code-based ? -- *Regards,* *Zhaojie*