RE: Sort Merge Join from the filesystem

2015-11-04 Thread Cheng, Hao
Yes, we probably need more change for the data source API if we need to 
implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think 
there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky 
> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system 
that have already been partitioned the same with the same number of partitions 
and sorted within each partition, without needing to repartition/sort them 
again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the 
Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be 
joined with dataset B before B even exists, by pre-processing A with a 
partition/sort.

Thanks.



Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint?

e.g.

df1.join(broadcast(df2))

the broadcast function is in org.apache.spark.sql.functions



On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel  wrote:

> Hi,
>
> If I have a hive table, analyze table compute statistics will ensure Spark
> SQL has statistics of that table. When I have a dataframe, is there a way
> to force spark to collect statistics?
>
> I have a large lookup file and I am trying to avoid a broadcast join by
> applying a filter before hand. This filtered RDD does not have statistics
> and so catalyst does not force a broadcast join. Unfortunately I have to
> use spark sql and cannot use dataframe api so cannot give a broadcast hint
> in the join.
>
> Example is this -
> If filtered RDD is saved as a table and compute stats is run, statistics
> are
>
> test.queryExecution.analyzed.statistics
> org.apache.spark.sql.catalyst.plans.logical.Statistics =
> Statistics(38851747)
>
>
> filtered RDD as is gives
> org.apache.spark.sql.catalyst.plans.logical.Statistics =
> Statistics(58403444019505585)
>
> filtered RDD forced to be materialized (cache/count), causes a different
> issue. Executors goes in a deadlock type state where not a single thread
> runs - for hours. I suspect cache a dataframe + broadcast join on same
> dataframe does this. As soon as cache is removed, the job moves forward.
>
> If there was a way for me to force statistics collection without caching a
> dataframe so Spark SQL would use it in a broadcast join?
>
> Thanks,
> Charmee
>


Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Egor Pahomov
+1

Things, which our infrastructure use and I checked:

Dynamic allocation
Spark ODBC server
Reading json
Writing parquet
SQL quires (hive context)
Running on CDH


2015-11-04 9:03 GMT-08:00 Sean Owen :

> As usual the signatures and licenses and so on look fine. I continue
> to get the same test failures on Ubuntu in Java 7/8:
>
> - Unpersisting HttpBroadcast on executors only in distributed mode ***
> FAILED ***
>
> But I continue to assume that's specific to tests and/or Ubuntu and/or
> the build profile, since I don't see any evidence of this in other
> builds on Jenkins. It's not a change from previous behavior, though it
> doesn't always happen either.
>
> On Tue, Nov 3, 2015 at 11:22 PM, Reynold Xin  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if
> a
> > majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.5.2
> > [ ] -1 Do not release this package because ...
> >
> >
> > The release fixes 59 known issues in Spark 1.5.1, listed here:
> > http://s.apache.org/spark-1.5.2
> >
> > The tag to be voted on is v1.5.2-rc2:
> > https://github.com/apache/spark/releases/tag/v1.5.2-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > - as version 1.5.2-rc2:
> > https://repository.apache.org/content/repositories/orgapachespark-1153
> > - as version 1.5.2:
> > https://repository.apache.org/content/repositories/orgapachespark-1152
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
> >
> >
> > ===
> > How can I help test this release?
> > ===
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > 
> > What justifies a -1 vote for this release?
> > 
> > -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present
> > in 1.5.1 will not block this release.
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 

*Sincerely yoursEgor Pakhomov, *

*AnchorFree*


Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any
other consideration for that ?


-- 
Best Regards

Jeff Zhang


RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


-- Forwarded message --
From: gen tang >
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org" 
>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen




RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
Probably 2 reasons:

1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation was 
created based on 1.3

2.  HadoopFsRelation introduces the concept of Partition, which probably 
not necessary for LibSVMRelation.

But I think it will be easy to change as extending from HadoopFsRelation.

Hao

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 10:31 AM
To: dev@spark.apache.org
Subject: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?


Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
consideration for that ?


--
Best Regards

Jeff Zhang


Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Thanks Hao. I have ready made it extends HadoopFsRelation and it works.
Will create a jira for that.

Besides that, I noticed that in DataSourceStrategy, spark build physical
plan based on the trait of the BaseRelation in pattern matching (e.g.
CatalystScan, TableScan, HadoopFsRelation). That means the order matters. I
think it is risky because that means one BaseRelation can't extends more
than 2 of these traits. And seems there's no place to restrict to extends
more than 2 traits. Maybe needs to clean and reorganize these traits
otherwise user may meets some weird issue when developing new DataSource.



On Thu, Nov 5, 2015 at 1:16 PM, Cheng, Hao  wrote:

> Probably 2 reasons:
>
> 1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation
> was created based on 1.3
>
> 2.  HadoopFsRelation introduces the concept of Partition, which
> probably not necessary for LibSVMRelation.
>
>
>
> But I think it will be easy to change as extending from HadoopFsRelation.
>
>
>
> Hao
>
>
>
> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
> *Sent:* Thursday, November 5, 2015 10:31 AM
> *To:* dev@spark.apache.org
> *Subject:* Why LibSVMRelation and CsvRelation don't extends
> HadoopFsRelation ?
>
>
>
>
>
> Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends
> HadoopFsRelation and leverage the features from HadoopFsRelation.  Any
> other consideration for that ?
>
>
>
>
>
> --
>
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
I think you’re right, we do offer the opportunity for developers to make 
mistakes while implementing the new Data Source.

Here we assume that the new relation MUST NOT extends more than one trait of 
the CatalystScan, TableScan, PrunedScan, PrunedFilteredScan , etc. otherwise it 
will causes problem as you described, probably we can add additional checking / 
reporting rule for the abuse.


From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 1:55 PM
To: Cheng, Hao
Cc: dev@spark.apache.org
Subject: Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

Thanks Hao. I have ready made it extends HadoopFsRelation and it works. Will 
create a jira for that.

Besides that, I noticed that in DataSourceStrategy, spark build physical plan 
based on the trait of the BaseRelation in pattern matching (e.g. CatalystScan, 
TableScan, HadoopFsRelation). That means the order matters. I think it is risky 
because that means one BaseRelation can't extends more than 2 of these traits. 
And seems there's no place to restrict to extends more than 2 traits. Maybe 
needs to clean and reorganize these traits otherwise user may meets some weird 
issue when developing new DataSource.



On Thu, Nov 5, 2015 at 1:16 PM, Cheng, Hao 
> wrote:
Probably 2 reasons:

1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation was 
created based on 1.3

2.  HadoopFsRelation introduces the concept of Partition, which probably 
not necessary for LibSVMRelation.

But I think it will be easy to change as extending from HadoopFsRelation.

Hao

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 10:31 AM
To: dev@spark.apache.org
Subject: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?


Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
consideration for that ?


--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang


pyspark with pypy not work for spark-1.5.1

2015-11-04 Thread Chang Ya-Hsuan
Hi all,

I am trying to run pyspark with pypy, and it is work when using spark-1.3.1
but failed when using spark-1.4.1 and spark-1.5.1

my pypy version:

$ /usr/bin/pypy --version
Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
[PyPy 2.2.1 with GCC 4.8.4]

works with spark-1.3.1

$ PYSPARK_PYTHON=/usr/bin/pypy ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
[PyPy 2.2.1 with GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a loopback
address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface eth0)
15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.3.1
  /_/

Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
SparkContext available as sc, HiveContext available as sqlContext.
And now for something completely different: ``Armin: "Prolog is a mess.",
CF:
"No, it's very cool!", Armin: "Isn't this what I said?"''
>>>

error message for 1.5.1

$ PYSPARK_PYTHON=/usr/bin/pypy ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
[PyPy 2.2.1 with GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "app_main.py", line 72, in run_toplevel
  File "app_main.py", line 614, in run_it
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
line 30, in 
import pyspark
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
line 41, in 
from pyspark.context import SparkContext
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
line 26, in 
from pyspark import accumulators
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
line 98, in 
from pyspark.serializers import read_int, PickleSerializer
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
line 400, in 
_hijack_namedtuple()
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
line 378, in _hijack_namedtuple
_old_namedtuple = _copy_func(collections.namedtuple)
  File
"/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
line 376, in _copy_func
f.__defaults__, f.__closure__)
AttributeError: 'function' object has no attribute '__closure__'
And now for something completely different: ``the traces don't lie''

is this a known issue? any suggestion to resolve it? or how can I help to
fix this problem?

Thanks.


Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Thanks Owen. Will do it

On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen  wrote:

> I'm pretty sure that attribute is required. I am not sure what PMML
> version the code has been written for but would assume 4.2.1. Feel
> free to open a PR to add this version to all the output.
>
> On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem  wrote:
> > [adding dev]
> >
> > On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem  wrote:
> >>
> >> I just went through all specifications, and they expect the version
> >> attribute. This should be addressed very soon because if we cannot use
> the
> >> PMML model without the version attribute, there is no use of generating
> one
> >> without it.
> >>
> >> On Wed, Nov 4, 2015 at 2:17 PM, Stefano Baghino
> >>  wrote:
> >>>
> >>> I used KNIME, which internally uses the org.dmg.pmml library.
> >>>
> >>> On Wed, Nov 4, 2015 at 9:45 AM, Fazlan Nazeem 
> wrote:
> 
>  Hi Stefano,
> 
>  Although the intention for my question wasn't as you expected, what
> you
>  say makes sense. The standard[1] for PMML 4.1 specifies that "For
> PMML 4.1
>  the attribute version must have the value 4.1". I'm not sure whether
> that
>  means that other PMML versions do not need that attribute to be set
>  explicitly. I hope someone would answer this.
> 
>  What was the tool you used to load the PMML?
> 
>  [1] http://dmg.org/pmml/v4-1/GeneralStructure.html
> 
>



-- 
Thanks & Regards,

Fazlan Nazeem

*Software Engineer*

*WSO2 Inc*
Mobile : +94772338839
<%2B94%20%280%29%20773%20451194>
fazl...@wso2.com


Re: Codegen In Shuffle

2015-11-04 Thread 牛兆捷
I see. Thanks very much.

2015-11-04 16:25 GMT+08:00 Reynold Xin :

> GenerateUnsafeProjection -- projects any internal row data structure
> directly into bytes (UnsafeRow).
>
>
> On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷  wrote:
>
>> Dear all:
>>
>> Tungsten project has mentioned that they are applying code generation is
>> to speed up the conversion of data from in-memory binary format to
>> wire-protocol for shuffle.
>>
>> Where can I find the related implementation in spark code-based ?
>>
>> --
>> *Regards,*
>> *Zhaojie*
>>
>>
>


-- 
*Regards,*
*Zhaojie*


Re: PMML version in MLLib

2015-11-04 Thread Sean Owen
I'm pretty sure that attribute is required. I am not sure what PMML
version the code has been written for but would assume 4.2.1. Feel
free to open a PR to add this version to all the output.

On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem  wrote:
> [adding dev]
>
> On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem  wrote:
>>
>> I just went through all specifications, and they expect the version
>> attribute. This should be addressed very soon because if we cannot use the
>> PMML model without the version attribute, there is no use of generating one
>> without it.
>>
>> On Wed, Nov 4, 2015 at 2:17 PM, Stefano Baghino
>>  wrote:
>>>
>>> I used KNIME, which internally uses the org.dmg.pmml library.
>>>
>>> On Wed, Nov 4, 2015 at 9:45 AM, Fazlan Nazeem  wrote:

 Hi Stefano,

 Although the intention for my question wasn't as you expected, what you
 say makes sense. The standard[1] for PMML 4.1 specifies that "For PMML 4.1
 the attribute version must have the value 4.1". I'm not sure whether that
 means that other PMML versions do not need that attribute to be set
 explicitly. I hope someone would answer this.

 What was the tool you used to load the PMML?

 [1] http://dmg.org/pmml/v4-1/GeneralStructure.html


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Master build fails ?

2015-11-04 Thread Jacek Laskowski
Hi,

It appears it's time to switch to my lovely sbt then!

Pozdrawiam,
Jacek

--
Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Tue, Nov 3, 2015 at 2:58 PM, Jean-Baptiste Onofré  wrote:
> Hi Jacek,
>
> it works fine with mvn: the problem is with sbt.
>
> I suspect a different reactor order in sbt compare to mvn.
>
> Regards
> JB
>
>
> On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>>
>> Hi,
>>
>> Just built the sources using the following command and it worked fine.
>>
>> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
>> -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
>> -DskipTests clean install
>> ...
>> [INFO]
>> 
>> [INFO] BUILD SUCCESS
>> [INFO]
>> 
>> [INFO] Total time: 14:15 min
>> [INFO] Finished at: 2015-11-03T14:40:40+01:00
>> [INFO] Final Memory: 438M/1972M
>> [INFO]
>> 
>>
>> ➜  spark git:(master) ✗ java -version
>> java version "1.8.0_66"
>> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>>
>> I'm on Mac OS.
>>
>> Pozdrawiam,
>> Jacek
>>
>> --
>> Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
>> Follow me at https://twitter.com/jaceklaskowski
>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>
>>
>> On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré 
>> wrote:
>>>
>>> Thanks for the update, I used mvn to build but without hive profile.
>>>
>>> Let me try with mvn with the same options as you and sbt also.
>>>
>>> I keep you posted.
>>>
>>> Regards
>>> JB
>>>
>>> On 11/03/2015 12:55 PM, Jeff Zhang wrote:


 I found it is due to SPARK-11073.

 Here's the command I used to build

 build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
 -Psparkr

 On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré > wrote:

  Hi Jeff,

  it works for me (with skipping the tests).

  Let me try again, just to be sure.

  Regards
  JB


  On 11/03/2015 11:50 AM, Jeff Zhang wrote:

  Looks like it's due to guava version conflicts, I see both
 guava
  14.0.1
  and 16.0.1 under lib_managed/bundles. Anyone meet this issue
 too ?

  [error]


 /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
  object HashCodes is not a member of package
 com.google.common.hash
  [error] import com.google.common.hash.HashCodes
  [error]^
  [info] Resolving org.apache.commons#commons-math;2.2 ...
  [error]


 /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384:
  not found: value HashCodes
  [error] val cookie =
 HashCodes.fromBytes(secret).toString()
  [error]  ^




  --
  Best Regards

  Jeff Zhang


  --
  Jean-Baptiste Onofré
  jbono...@apache.org 
  http://blog.nanthrax.net
  Talend - http://www.talend.com


 -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  
  For additional commands, e-mail: dev-h...@spark.apache.org
  




 --
 Best Regards

 Jeff Zhang
>>>
>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>


Re: Please reply if you use Mesos fine grained mode

2015-11-04 Thread Heller, Chris
We’ve been making use of both. Fine-grain mode makes sense for more ad-hoc work 
loads, and coarse-grained for more job like loads on a common data set. My 
preference is the fine-grain mode in all cases, but the overhead associated 
with its startup and the possibility that an overloaded cluster would be 
starved for resources makes coarse grain mode a reality at the moment.

On Wednesday, 4 November 2015 5:24 AM, Reynold Xin 
> wrote:


If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?

Thanks.





Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Jean-Baptiste Onofré

+1 (non binding)

Just tested with some snippets on my side.

Regards
JB

On 11/04/2015 12:22 AM, Reynold Xin wrote:

Please vote on releasing the following candidate as Apache Spark version
1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.2
[ ] -1 Do not release this package because ...


The release fixes 59 known issues in Spark 1.5.1, listed here:
http://s.apache.org/spark-1.5.2

The tag to be voted on is v1.5.2-rc2:
https://github.com/apache/spark/releases/tag/v1.5.2-rc2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
- as version 1.5.2-rc2:
https://repository.apache.org/content/repositories/orgapachespark-1153
- as version 1.5.2:
https://repository.apache.org/content/repositories/orgapachespark-1152

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


What justifies a -1 vote for this release?

-1 vote should occur for regressions from Spark 1.5.1. Bugs already
present in 1.5.1 will not block this release.




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure
directly into bytes (UnsafeRow).


On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷  wrote:

> Dear all:
>
> Tungsten project has mentioned that they are applying code generation is
> to speed up the conversion of data from in-memory binary format to
> wire-protocol for shuffle.
>
> Where can I find the related implementation in spark code-based ?
>
> --
> *Regards,*
> *Zhaojie*
>
>


Codegen In Shuffle

2015-11-04 Thread 牛兆捷
Dear all:

Tungsten project has mentioned that they are applying code generation is to
speed up the conversion of data from in-memory binary format to
wire-protocol for shuffle.

Where can I find the related implementation in spark code-based ?

-- 
*Regards,*
*Zhaojie*