Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Chang Ya-Hsuan
Thanks for your quickly reply.

I will test several pypy versions and report the result later.

On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen  wrote:

> I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
> docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
> version to see if that works?
>
> I just checked and it looks like our Jenkins tests are running against
> PyPy 2.5.1, so that version is known to work. I'm not sure what the actual
> minimum supported PyPy version is. Would you be interested in helping to
> investigate so that we can update the documentation or produce a fix to
> restore compatibility with earlier PyPy builds?
>
> On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan 
> wrote:
>
>> Hi all,
>>
>> I am trying to run pyspark with pypy, and it is work when using
>> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>>
>> my pypy version:
>>
>> $ /usr/bin/pypy --version
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4]
>>
>> works with spark-1.3.1
>>
>> $ PYSPARK_PYTHON=/usr/bin/pypy
>> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a
>> loopback address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface
>> eth0)
>> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
>> another address
>> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/__ / .__/\_,_/_/ /_/\_\   version 1.3.1
>>   /_/
>>
>> Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
>> SparkContext available as sc, HiveContext available as sqlContext.
>> And now for something completely different: ``Armin: "Prolog is a mess.",
>> CF:
>> "No, it's very cool!", Armin: "Isn't this what I said?"''
>> >>>
>>
>> error message for 1.5.1
>>
>> $ PYSPARK_PYTHON=/usr/bin/pypy
>> ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> Traceback (most recent call last):
>>   File "app_main.py", line 72, in run_toplevel
>>   File "app_main.py", line 614, in run_it
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
>> line 30, in 
>> import pyspark
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
>> line 41, in 
>> from pyspark.context import SparkContext
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
>> line 26, in 
>> from pyspark import accumulators
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
>> line 98, in 
>> from pyspark.serializers import read_int, PickleSerializer
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>> line 400, in 
>> _hijack_namedtuple()
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>> line 378, in _hijack_namedtuple
>> _old_namedtuple = _copy_func(collections.namedtuple)
>>   File
>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>> line 376, in _copy_func
>> f.__defaults__, f.__closure__)
>> AttributeError: 'function' object has no attribute '__closure__'
>> And now for something completely different: ``the traces don't lie''
>>
>> is this a known issue? any suggestion to resolve it? or how can I help to
>> fix this problem?
>>
>> Thanks.
>>
>
>


-- 
-- 張雅軒


Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Josh Rosen
I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
version to see if that works?

I just checked and it looks like our Jenkins tests are running against PyPy
2.5.1, so that version is known to work. I'm not sure what the actual
minimum supported PyPy version is. Would you be interested in helping to
investigate so that we can update the documentation or produce a fix to
restore compatibility with earlier PyPy builds?

On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan  wrote:

> Hi all,
>
> I am trying to run pyspark with pypy, and it is work when using
> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>
> my pypy version:
>
> $ /usr/bin/pypy --version
> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
> [PyPy 2.2.1 with GCC 4.8.4]
>
> works with spark-1.3.1
>
> $ PYSPARK_PYTHON=/usr/bin/pypy
> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
> [PyPy 2.2.1 with GCC 4.8.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a loopback
> address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface eth0)
> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.3.1
>   /_/
>
> Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
> SparkContext available as sc, HiveContext available as sqlContext.
> And now for something completely different: ``Armin: "Prolog is a mess.",
> CF:
> "No, it's very cool!", Armin: "Isn't this what I said?"''
> >>>
>
> error message for 1.5.1
>
> $ PYSPARK_PYTHON=/usr/bin/pypy
> ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
> [PyPy 2.2.1 with GCC 4.8.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "app_main.py", line 614, in run_it
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
> line 30, in 
> import pyspark
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
> line 41, in 
> from pyspark.context import SparkContext
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
> line 26, in 
> from pyspark import accumulators
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
> line 98, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
> line 400, in 
> _hijack_namedtuple()
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
> line 378, in _hijack_namedtuple
> _old_namedtuple = _copy_func(collections.namedtuple)
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
> line 376, in _copy_func
> f.__defaults__, f.__closure__)
> AttributeError: 'function' object has no attribute '__closure__'
> And now for something completely different: ``the traces don't lie''
>
> is this a known issue? any suggestion to resolve it? or how can I help to
> fix this problem?
>
> Thanks.
>


Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Chang Ya-Hsuan
I've test on following pypy version against to spark-1.5.1

  pypy-2.2.1
  pypy-2.3
  pypy-2.3.1
  pypy-2.4.0
  pypy-2.5.0
  pypy-2.5.1
  pypy-2.6.0
  pypy-2.6.1

I run

$ PYSPARK_PYTHON=/path/to/pypy-xx.xx/bin/pypy
/path/to/spark-1.5.1/bin/pyspark

and only pypy-2.2.1 failed.

Any suggestion to run advanced test?

On Thu, Nov 5, 2015 at 4:14 PM, Chang Ya-Hsuan  wrote:

> Thanks for your quickly reply.
>
> I will test several pypy versions and report the result later.
>
> On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen  wrote:
>
>> I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
>> docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
>> version to see if that works?
>>
>> I just checked and it looks like our Jenkins tests are running against
>> PyPy 2.5.1, so that version is known to work. I'm not sure what the actual
>> minimum supported PyPy version is. Would you be interested in helping to
>> investigate so that we can update the documentation or produce a fix to
>> restore compatibility with earlier PyPy builds?
>>
>> On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to run pyspark with pypy, and it is work when using
>>> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>>>
>>> my pypy version:
>>>
>>> $ /usr/bin/pypy --version
>>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>>> [PyPy 2.2.1 with GCC 4.8.4]
>>>
>>> works with spark-1.3.1
>>>
>>> $ PYSPARK_PYTHON=/usr/bin/pypy
>>> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
>>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a
>>> loopback address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface
>>> eth0)
>>> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
>>> another address
>>> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/__ / .__/\_,_/_/ /_/\_\   version 1.3.1
>>>   /_/
>>>
>>> Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
>>> SparkContext available as sc, HiveContext available as sqlContext.
>>> And now for something completely different: ``Armin: "Prolog is a
>>> mess.", CF:
>>> "No, it's very cool!", Armin: "Isn't this what I said?"''
>>> >>>
>>>
>>> error message for 1.5.1
>>>
>>> $ PYSPARK_PYTHON=/usr/bin/pypy
>>> ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
>>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> Traceback (most recent call last):
>>>   File "app_main.py", line 72, in run_toplevel
>>>   File "app_main.py", line 614, in run_it
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
>>> line 30, in 
>>> import pyspark
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
>>> line 41, in 
>>> from pyspark.context import SparkContext
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
>>> line 26, in 
>>> from pyspark import accumulators
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
>>> line 98, in 
>>> from pyspark.serializers import read_int, PickleSerializer
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>>> line 400, in 
>>> _hijack_namedtuple()
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>>> line 378, in _hijack_namedtuple
>>> _old_namedtuple = _copy_func(collections.namedtuple)
>>>   File
>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>>> line 376, in _copy_func
>>> f.__defaults__, f.__closure__)
>>> AttributeError: 'function' object has no attribute '__closure__'
>>> And now for something completely different: ``the traces don't lie''
>>>
>>> is this a known issue? any suggestion to resolve it? or how can I help
>>> to fix this problem?
>>>
>>> Thanks.
>>>
>>
>>
>
>
> --
> -- 張雅軒
>



-- 
-- 張雅軒


Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-05 Thread Sjoerd Mulder
Hi Reynold,

I had version 2.6.1 in my project which was provided by the fine folks
from spring-boot-dependencies.

Now have overridden it to 2.7.8 :)

Sjoerd

2015-11-01 8:22 GMT+01:00 Reynold Xin :

> Thanks for reporting it, Sjoerd. You might have a different version of
> Janino brought in from somewhere else.
>
> This should fix your problem: https://github.com/apache/spark/pull/9372
>
> Can you give it a try?
>
>
>
> On Tue, Oct 27, 2015 at 9:12 PM, Sjoerd Mulder 
> wrote:
>
>> No the job actually doesn't fail, but since our tests is generating all
>> these stacktraces i have disabled the tungsten mode just to be sure (and
>> don't have gazilion stacktraces in production).
>>
>> 2015-10-27 20:59 GMT+01:00 Josh Rosen :
>>
>>> Hi Sjoerd,
>>>
>>> Did your job actually *fail* or did it just generate many spurious
>>> exceptions? While the stacktrace that you posted does indicate a bug, I
>>> don't think that it should have stopped query execution because Spark
>>> should have fallen back to an interpreted code path (note the "Failed
>>> to generate ordering, fallback to interpreted" in the error message).
>>>
>>> On Tue, Oct 27, 2015 at 12:56 PM Sjoerd Mulder 
>>> wrote:
>>>
 I have disabled it because of it started generating ERROR's when
 upgrading from Spark 1.4 to 1.5.1

 2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed
 to generate ordering, fallback to interpreted
 java.util.concurrent.ExecutionException: java.lang.Exception: failed to
 compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9:
 Invalid character input "@" (character code 64)

 public SpecificOrdering
 generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
   return new SpecificOrdering(expr);
 }

 class SpecificOrdering extends
 org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {

   private org.apache.spark.sql.catalyst.expressions.Expression[]
 expressions;



   public
 SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[]
 expr) {
 expressions = expr;

   }

   @Override
   public int compare(InternalRow a, InternalRow b) {
 InternalRow i = null;  // Holds current row being evaluated.

 i = a;
 boolean isNullA2;
 long primitiveA3;
 {
   /* input[2, LongType] */

   boolean isNull0 = i.isNullAt(2);
   long primitive1 = isNull0 ? -1L : (i.getLong(2));

   isNullA2 = isNull0;
   primitiveA3 = primitive1;
 }
 i = b;
 boolean isNullB4;
 long primitiveB5;
 {
   /* input[2, LongType] */

   boolean isNull0 = i.isNullAt(2);
   long primitive1 = isNull0 ? -1L : (i.getLong(2));

   isNullB4 = isNull0;
   primitiveB5 = primitive1;
 }
 if (isNullA2 && isNullB4) {
   // Nothing
 } else if (isNullA2) {
   return 1;
 } else if (isNullB4) {
   return -1;
 } else {
   int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 <
 primitiveB5 ? -1 : 0);
   if (comp != 0) {
 return -comp;
   }
 }

 return 0;
   }
 }

 at
 org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
 at
 org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
 at
 org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 at
 org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
 at
 org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
 at
 org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
 at
 org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at
 org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
 at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
 at
 org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at
 org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.compile(CodeGenerator.scala:362)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:139)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:37)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:425)
 at
 

Fwd: dataframe slow down with tungsten turn on

2015-11-05 Thread gen tang
-- Forwarded message --
From: gen tang 
Date: Fri, Nov 6, 2015 at 12:14 AM
Subject: Re: dataframe slow down with tungsten turn on
To: "Cheng, Hao" 


Hi,

My application is as follows:
1. create dataframe from hive table
2. transform dataframe to rdd of json and do some aggregations on json (in
fact, I use pyspark, so it is rdd of dict)
3. retransform rdd of json to dataframe and cache it (triggered by count)
4. join several dataframe which is created by the above steps.
5. save final dataframe as json.(by dataframe write api)

There are a lot of stages, other stage is quite the same under two version
of spark. However, the final step (save as json) is 1 min vs. 2 hour. In my
opinion, I think it is writing to hdfs cause the slowness of final stage.
However, I don't know why...

In fact, I make a mistake about the version of spark that I used. The spark
which runs faster is build on source code of spark 1.4.1. The spark which
runs slower is build on source code of spark 1.5.2, 2 days ago.

Any idea? Thanks a lot.

Cheers
Gen


On Thu, Nov 5, 2015 at 1:01 PM, Cheng, Hao  wrote:

> BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more
> information on the ETL work?
>
>
>
> *From:* Cheng, Hao [mailto:hao.ch...@intel.com]
> *Sent:* Thursday, November 5, 2015 12:56 PM
> *To:* gen tang; dev@spark.apache.org
> *Subject:* RE: dataframe slow down with tungsten turn on
>
>
>
> 1.5 has critical performance / bug issues, you’d better try 1.5.1 or
> 1.5.2rc version.
>
>
>
> *From:* gen tang [mailto:gen.tan...@gmail.com ]
> *Sent:* Thursday, November 5, 2015 12:43 PM
> *To:* dev@spark.apache.org
> *Subject:* Fwd: dataframe slow down with tungsten turn on
>
>
>
> Hi,
>
>
>
> In fact, I tested the same code with spark 1.5 with tungsten turning off.
> The result is quite the same as tungsten turning on.
>
> It seems that it is not the problem of tungsten, it is simply that spark
> 1.5 is slower than spark 1.4.
>
>
>
> Is there any idea about why it happens?
>
> Thanks a lot in advance
>
>
>
> Cheers
>
> Gen
>
>
>
>
>
> -- Forwarded message --
> From: *gen tang* 
> Date: Wed, Nov 4, 2015 at 3:54 PM
> Subject: dataframe slow down with tungsten turn on
> To: "u...@spark.apache.org" 
>
> Hi sparkers,
>
>
>
> I am using dataframe to do some large ETL jobs.
>
> More precisely, I create dataframe from HIVE table and do some operations.
> And then I save it as json.
>
>
>
> When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
> However, when I use the same code with spark-1.5.1(with tungsten turn on),
> it takes a about 2 hours to finish the same job.
>
>
>
> I checked the detail of tasks, almost all the time is consumed by
> computation.
>
> Any idea about why this happens?
>
>
>
> Thanks a lot in advance for your help.
>
>
>
> Cheers
>
> Gen
>
>
>
>
>


Recommended change to core-site.xml template

2015-11-05 Thread Christian
We ended up reading and writing to S3 a ton in our Spark jobs.
For this to work, we ended up having to add s3a, and s3 key/secret pairs.
We also had to add fs.hdfs.impl to get these things to work.

I thought maybe I'd share what we did and it might be worth adding these to
the spark conf for out of the box functionality with S3.

We created:
ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml

We changed the contents form the original, adding in the following:

  
fs.file.impl
org.apache.hadoop.fs.LocalFileSystem
  

  
fs.hdfs.impl
org.apache.hadoop.hdfs.DistributedFileSystem
  

  
fs.s3.impl
org.apache.hadoop.fs.s3native.NativeS3FileSystem
  

  
fs.s3.awsAccessKeyId
{{aws_access_key_id}}
  

  
fs.s3.awsSecretAccessKey
{{aws_secret_access_key}}
  

  
fs.s3n.awsAccessKeyId
{{aws_access_key_id}}
  

  
fs.s3n.awsSecretAccessKey
{{aws_secret_access_key}}
  

  
fs.s3a.awsAccessKeyId
{{aws_access_key_id}}
  

  
fs.s3a.awsSecretAccessKey
{{aws_secret_access_key}}
  

This change makes spark on ec2 work out of the box for us. It took us
several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
version 2.

Best Regards,
Christian


Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
Thanks for sharing this, Christian.

What build of Spark are you using? If I understand correctly, if you are
using Spark built against Hadoop 2.6+ then additional configs alone won't
help because additional libraries also need to be installed
.

Nick

On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:

> We ended up reading and writing to S3 a ton in our Spark jobs.
> For this to work, we ended up having to add s3a, and s3 key/secret pairs.
> We also had to add fs.hdfs.impl to get these things to work.
>
> I thought maybe I'd share what we did and it might be worth adding these
> to the spark conf for out of the box functionality with S3.
>
> We created:
> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>
> We changed the contents form the original, adding in the following:
>
>   
> fs.file.impl
> org.apache.hadoop.fs.LocalFileSystem
>   
>
>   
> fs.hdfs.impl
> org.apache.hadoop.hdfs.DistributedFileSystem
>   
>
>   
> fs.s3.impl
> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>   
>
>   
> fs.s3.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3n.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3n.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3a.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3a.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
> This change makes spark on ec2 work out of the box for us. It took us
> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
> version 2.
>
> Best Regards,
> Christian
>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Shivaram Venkataraman
Thanks for investigating this. The right place to add these is the
core-site.xml template we have at
https://github.com/amplab/spark-ec2/blob/branch-1.5/templates/root/spark/conf/core-site.xml
and/or 
https://github.com/amplab/spark-ec2/blob/branch-1.5/templates/root/ephemeral-hdfs/conf/core-site.xml

Feel free to open a PR against the amplab/spark-ec2 repository for this.

Thanks
Shivaram

On Thu, Nov 5, 2015 at 8:25 AM, Christian  wrote:
> We ended up reading and writing to S3 a ton in our Spark jobs.
> For this to work, we ended up having to add s3a, and s3 key/secret pairs. We
> also had to add fs.hdfs.impl to get these things to work.
>
> I thought maybe I'd share what we did and it might be worth adding these to
> the spark conf for out of the box functionality with S3.
>
> We created:
> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>
> We changed the contents form the original, adding in the following:
>
>   
> fs.file.impl
> org.apache.hadoop.fs.LocalFileSystem
>   
>
>   
> fs.hdfs.impl
> org.apache.hadoop.hdfs.DistributedFileSystem
>   
>
>   
> fs.s3.impl
> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>   
>
>   
> fs.s3.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3n.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3n.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3a.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3a.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
> This change makes spark on ec2 work out of the box for us. It took us
> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
> version 2.
>
> Best Regards,
> Christian

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
> I am using both 1.4.1 and 1.5.1.

That's the Spark version. I'm wondering what version of Hadoop your Spark
is built against.

For example, when you download Spark
 you have to select from a number
of packages (under "Choose a package type"), and each is built against a
different version of Hadoop. When Spark is built against Hadoop 2.6+, from
my understanding, you need to install additional libraries
 to access S3. When Spark
is built against Hadoop 2.4 or earlier, you don't need to do this.

I'm confirming that this is what is happening in your case.

Nick

On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:

> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of the
> new feature for instance-profile which greatly helps with this as well.
> Without the instance-profile, we got it working by copying a
> .aws/credentials file up to each node. We could easily automate that
> through the templates.
>
> I don't need any additional libraries. We just need to change the
> core-site.xml
>
> -Christian
>
> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for sharing this, Christian.
>>
>> What build of Spark are you using? If I understand correctly, if you are
>> using Spark built against Hadoop 2.6+ then additional configs alone won't
>> help because additional libraries also need to be installed
>> .
>>
>> Nick
>>
>> On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:
>>
>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>
>>> I thought maybe I'd share what we did and it might be worth adding these
>>> to the spark conf for out of the box functionality with S3.
>>>
>>> We created:
>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>
>>> We changed the contents form the original, adding in the following:
>>>
>>>   
>>> fs.file.impl
>>> org.apache.hadoop.fs.LocalFileSystem
>>>   
>>>
>>>   
>>> fs.hdfs.impl
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>   
>>>
>>>   
>>> fs.s3.impl
>>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>>   
>>>
>>>   
>>> fs.s3.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>>   
>>> fs.s3n.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3n.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>>   
>>> fs.s3a.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3a.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>> This change makes spark on ec2 work out of the box for us. It took us
>>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
>>> version 2.
>>>
>>> Best Regards,
>>> Christian
>>>
>>
>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Christian
I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of the
new feature for instance-profile which greatly helps with this as well.
Without the instance-profile, we got it working by copying a
.aws/credentials file up to each node. We could easily automate that
through the templates.

I don't need any additional libraries. We just need to change the
core-site.xml

-Christian

On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas  wrote:

> Thanks for sharing this, Christian.
>
> What build of Spark are you using? If I understand correctly, if you are
> using Spark built against Hadoop 2.6+ then additional configs alone won't
> help because additional libraries also need to be installed
> .
>
> Nick
>
> On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:
>
>> We ended up reading and writing to S3 a ton in our Spark jobs.
>> For this to work, we ended up having to add s3a, and s3 key/secret pairs.
>> We also had to add fs.hdfs.impl to get these things to work.
>>
>> I thought maybe I'd share what we did and it might be worth adding these
>> to the spark conf for out of the box functionality with S3.
>>
>> We created:
>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>
>> We changed the contents form the original, adding in the following:
>>
>>   
>> fs.file.impl
>> org.apache.hadoop.fs.LocalFileSystem
>>   
>>
>>   
>> fs.hdfs.impl
>> org.apache.hadoop.hdfs.DistributedFileSystem
>>   
>>
>>   
>> fs.s3.impl
>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>   
>>
>>   
>> fs.s3.awsAccessKeyId
>> {{aws_access_key_id}}
>>   
>>
>>   
>> fs.s3.awsSecretAccessKey
>> {{aws_secret_access_key}}
>>   
>>
>>   
>> fs.s3n.awsAccessKeyId
>> {{aws_access_key_id}}
>>   
>>
>>   
>> fs.s3n.awsSecretAccessKey
>> {{aws_secret_access_key}}
>>   
>>
>>   
>> fs.s3a.awsAccessKeyId
>> {{aws_access_key_id}}
>>   
>>
>>   
>> fs.s3a.awsSecretAccessKey
>> {{aws_secret_access_key}}
>>   
>>
>> This change makes spark on ec2 work out of the box for us. It took us
>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
>> version 2.
>>
>> Best Regards,
>> Christian
>>
>


Re: Master build fails ?

2015-11-05 Thread Steve Loughran
SBT/ivy pulls in the most recent version of a JAR in, whereas maven pulls in 
the "closest", where closest is lowest distance/depth from the root.


> On 5 Nov 2015, at 18:53, Marcelo Vanzin  wrote:
> 
> Seems like it's an sbt issue, not a maven one, so "dependency:tree"
> might not help. Still, the command line would be helpful. I use sbt
> and don't see this.
> 
> On Thu, Nov 5, 2015 at 10:44 AM, Marcelo Vanzin  wrote:
>> Hi Jeff,
>> 
>> On Tue, Nov 3, 2015 at 2:50 AM, Jeff Zhang  wrote:
>>> Looks like it's due to guava version conflicts, I see both guava 14.0.1 and
>>> 16.0.1 under lib_managed/bundles. Anyone meet this issue too ?
>> 
>> What command line are you using to build? Can you run "mvn
>> dependency:tree" (with all the other options you're using) to figure
>> out where guava 16 is coming from? Locally I only see version 14,
>> compiling against hadoop 2.5.0.
>> 
>> --
>> Marcelo
> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Master build fails ?

2015-11-05 Thread Marcelo Vanzin
Does anyone know how to get something similar to "mvn dependency:tree" from sbt?

mvn dependency:tree with hadoop 2.6.0 does not show any instances of guava 16...

On Thu, Nov 5, 2015 at 11:37 AM, Ted Yu  wrote:
> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
> -Dhadoop.version=2.6.0 -DskipTests assembly
>
> The above command fails on Mac.
>
> build/sbt -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -Pkinesis-asl
> -DskipTests assembly
>
> The above command, used by Jenkins, passes.
> That's why the build error wasn't caught.
>
> FYI
>
> On Thu, Nov 5, 2015 at 11:07 AM, Dilip Biswal  wrote:
>>
>> Hello Ted,
>>
>> Thanks for your response.
>>
>> Here is the command i used :
>>
>> build/sbt clean
>> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
>> -Dhadoop.version=2.6.0 -DskipTests assembly
>>
>> I am building on CentOS and on master branch.
>>
>> One other thing, i was able to build fine with the above command up until
>> recently. I think i have stared
>> to have problem after SPARK-11073 where the HashCodes import was added.
>>
>> Regards,
>> Dilip Biswal
>> Tel: 408-463-4980
>> dbis...@us.ibm.com
>>
>>
>>
>> From:Ted Yu 
>> To:Dilip Biswal/Oakland/IBM@IBMUS
>> Cc:Jean-Baptiste Onofré , "dev@spark.apache.org"
>> 
>> Date:11/05/2015 10:46 AM
>> Subject:Re: Master build fails ?
>> 
>>
>>
>>
>> Dilip:
>> Can you give the command you used ?
>>
>> Which release were you building ?
>> What OS did you build on ?
>>
>> Cheers
>>
>> On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:
>> Hello,
>>
>> I am getting the same build error about not being able tofind
>> com.google.common.hash.HashCodes.
>>
>>
>> Is there a solution to this ?
>>
>> Regards,
>> Dilip Biswal
>> Tel: 408-463-4980
>> dbis...@us.ibm.com
>>
>>
>>
>> From:Jean-Baptiste Onofré 
>> To:Ted Yu 
>> Cc:"dev@spark.apache.org" 
>> Date:11/03/2015 07:20 AM
>> Subject:Re: Master build fails ?
>> 
>>
>>
>>
>> Hi Ted,
>>
>> thanks for the update. The build with sbt is in progress on my box.
>>
>> Regards
>> JB
>>
>> On 11/03/2015 03:31 PM, Ted Yu wrote:
>> > Interesting, Sbt builds were not all failing:
>> >
>> > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
>> >
>> > FYI
>> >
>> > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré > > > wrote:
>>
>> >
>> > Hi Jacek,
>> >
>> > it works fine with mvn: the problem is with sbt.
>> >
>> > I suspect a different reactor order in sbt compare to mvn.
>> >
>> > Regards
>> > JB
>> >
>> > On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>> >
>> > Hi,
>> >
>> > Just built the sources using the following command and it worked
>> > fine.
>> >
>> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
>> > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
>> > -DskipTests clean install
>> > ...
>> > [INFO]
>> >
>> > 
>> > [INFO] BUILD SUCCESS
>> > [INFO]
>> >
>> > 
>> > [INFO] Total time: 14:15 min
>> > [INFO] Finished at: 2015-11-03T14:40:40+01:00
>> > [INFO] Final Memory: 438M/1972M
>> > [INFO]
>> >
>> > 
>> >
>> > ➜  spark git:(master) ✗ java -version
>> > java version "1.8.0_66"
>> > Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
>> > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>> >
>> > I'm on Mac OS.
>> >
>> > Pozdrawiam,
>> > Jacek
>> >
>> > --
>> > Jacek Laskowski |
>> http://blog.japila.pl|
>> > http://blog.jaceklaskowski.pl
>>
>> > Follow me at https://twitter.com/jaceklaskowski
>> > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>
>> >
>> >
>> > On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
>> > > wrote:
>> >
>> > Thanks for the update, I used mvn to build but without hive
>> > profile.
>> >
>> > Let me try with mvn with the same options as you and sbt
>> > also.
>> >
>> > I keep you posted.
>> >
>> > Regards
>> > JB
>> >
>> > On 11/03/2015 12:55 PM, Jeff Zhang wrote:
>> >
>> >
>> > I found it is due to SPARK-11073.
>> >
>> > Here's the command I used to build
>> >
>> > build/sbt clean compile -Pyarn 

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-05 Thread Nicholas Chammas
-0

The spark-ec2 version is still set to 1.5.1
.

Nick

On Wed, Nov 4, 2015 at 8:20 PM Egor Pahomov  wrote:

> +1
>
> Things, which our infrastructure use and I checked:
>
> Dynamic allocation
> Spark ODBC server
> Reading json
> Writing parquet
> SQL quires (hive context)
> Running on CDH
>
>
> 2015-11-04 9:03 GMT-08:00 Sean Owen :
>
>> As usual the signatures and licenses and so on look fine. I continue
>> to get the same test failures on Ubuntu in Java 7/8:
>>
>> - Unpersisting HttpBroadcast on executors only in distributed mode ***
>> FAILED ***
>>
>> But I continue to assume that's specific to tests and/or Ubuntu and/or
>> the build profile, since I don't see any evidence of this in other
>> builds on Jenkins. It's not a change from previous behavior, though it
>> doesn't always happen either.
>>
>> On Tue, Nov 3, 2015 at 11:22 PM, Reynold Xin  wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes
>> if a
>> > majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.5.2
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The release fixes 59 known issues in Spark 1.5.1, listed here:
>> > http://s.apache.org/spark-1.5.2
>> >
>> > The tag to be voted on is v1.5.2-rc2:
>> > https://github.com/apache/spark/releases/tag/v1.5.2-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > - as version 1.5.2-rc2:
>> > https://repository.apache.org/content/repositories/orgapachespark-1153
>> > - as version 1.5.2:
>> > https://repository.apache.org/content/repositories/orgapachespark-1152
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
>> >
>> >
>> > ===
>> > How can I help test this release?
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > 
>> > What justifies a -1 vote for this release?
>> > 
>> > -1 vote should occur for regressions from Spark 1.5.1. Bugs already
>> present
>> > in 1.5.1 will not block this release.
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> --
>
> *Sincerely yoursEgor Pakhomov, *
>
> *AnchorFree*
>
>


Re: Master build fails ?

2015-11-05 Thread Ted Yu
build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
-Dhadoop.version=2.6.0 -DskipTests assembly

The above command fails on Mac.

build/sbt -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -Pkinesis-asl
-DskipTests assembly

The above command, used by Jenkins, passes.
That's why the build error wasn't caught.

FYI

On Thu, Nov 5, 2015 at 11:07 AM, Dilip Biswal  wrote:

> Hello Ted,
>
> Thanks for your response.
>
> Here is the command i used :
>
> build/sbt clean
> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
> -Dhadoop.version=2.6.0 -DskipTests assembly
>
> I am building on CentOS and on master branch.
>
> One other thing, i was able to build fine with the above command up until
> recently. I think i have stared
> to have problem after SPARK-11073 where the HashCodes import was added.
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> From:Ted Yu 
> To:Dilip Biswal/Oakland/IBM@IBMUS
> Cc:Jean-Baptiste Onofré , "dev@spark.apache.org"
> 
> Date:11/05/2015 10:46 AM
> Subject:Re: Master build fails ?
> --
>
>
>
> Dilip:
> Can you give the command you used ?
>
> Which release were you building ?
> What OS did you build on ?
>
> Cheers
>
> On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal <*dbis...@us.ibm.com*
> > wrote:
> Hello,
>
> I am getting the same build error about not being able tofind
> com.google.common.hash.HashCodes.
>
>
> Is there a solution to this ?
>
> Regards,
> Dilip Biswal
> Tel: *408-463-4980* <408-463-4980>
> *dbis...@us.ibm.com* 
>
>
>
> From:Jean-Baptiste Onofré <*j...@nanthrax.net* >
> To:Ted Yu <*yuzhih...@gmail.com* >
> Cc:"*dev@spark.apache.org* " <
> *dev@spark.apache.org* >
> Date:11/03/2015 07:20 AM
> Subject:Re: Master build fails ?
> --
>
>
>
> Hi Ted,
>
> thanks for the update. The build with sbt is in progress on my box.
>
> Regards
> JB
>
> On 11/03/2015 03:31 PM, Ted Yu wrote:
> > Interesting, Sbt builds were not all failing:
> >
> > *https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/*
> 
> >
> > FYI
> >
> > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré <*j...@nanthrax.net*
> 
> > <*mailto:j...@nanthrax.net* >> wrote:
>
> >
> > Hi Jacek,
> >
> > it works fine with mvn: the problem is with sbt.
> >
> > I suspect a different reactor order in sbt compare to mvn.
> >
> > Regards
> > JB
> >
> > On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
> >
> > Hi,
> >
> > Just built the sources using the following command and it worked
> > fine.
> >
> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> > -DskipTests clean install
> > ...
> > [INFO]
> >
> 
> > [INFO] BUILD SUCCESS
> > [INFO]
> >
> 
> > [INFO] Total time: 14:15 min
> > [INFO] Finished at: 2015-11-03T14:40:40+01:00
> > [INFO] Final Memory: 438M/1972M
> > [INFO]
> >
> 
> >
> > ➜  spark git:(master) ✗ java -version
> > java version "1.8.0_66"
> > Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
> >
> > I'm on Mac OS.
> >
> > Pozdrawiam,
> > Jacek
> >
> > --
> > Jacek Laskowski |
> *http://blog.japila.pl* |
> > *http://blog.jaceklaskowski.pl* 
>
> > Follow me at *https://twitter.com/jaceklaskowski*
> 
> > Upvote at
> *http://stackoverflow.com/users/1305344/jacek-laskowski*
> 
>
> >
> >
> > On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> > <*j...@nanthrax.net* <*mailto:j...@nanthrax.net*
> >> wrote:
> >
> > Thanks for the update, I used mvn to build but without hive
> > profile.
> >
> > Let me try with mvn with the same options as you and sbt
> also.
> >
> > I keep you posted.
> >
> > Regards
> > JB
> >
> > On 11/03/2015 12:55 PM, Jeff Zhang wrote:
> >
> >
> > I found it is due to SPARK-11073.
> >
> > Here's the command I used to build
> >
> 

Re: Master build fails ?

2015-11-05 Thread Marcelo Vanzin
Man that command is slow. Anyway, it seems guava 16 is being brought
transitively by curator 2.6.0 which should have been overridden by the
explicit dependency on curator 2.4.0, but apparently, as Steve
mentioned, sbt/ivy decided to break things, so I'll be adding some
exclusions.

On Thu, Nov 5, 2015 at 11:55 AM, Marcelo Vanzin  wrote:
> Answering my own question: "dependency-graph"
>
> On Thu, Nov 5, 2015 at 11:44 AM, Marcelo Vanzin  wrote:
>> Does anyone know how to get something similar to "mvn dependency:tree" from 
>> sbt?
>>
>> mvn dependency:tree with hadoop 2.6.0 does not show any instances of guava 
>> 16...
>>
>> On Thu, Nov 5, 2015 at 11:37 AM, Ted Yu  wrote:
>>> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
>>> -Dhadoop.version=2.6.0 -DskipTests assembly
>>>
>>> The above command fails on Mac.
>>>
>>> build/sbt -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -Pkinesis-asl
>>> -DskipTests assembly
>>>
>>> The above command, used by Jenkins, passes.
>>> That's why the build error wasn't caught.
>>>
>>> FYI
>>>
>>> On Thu, Nov 5, 2015 at 11:07 AM, Dilip Biswal  wrote:

 Hello Ted,

 Thanks for your response.

 Here is the command i used :

 build/sbt clean
 build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
 -Dhadoop.version=2.6.0 -DskipTests assembly

 I am building on CentOS and on master branch.

 One other thing, i was able to build fine with the above command up until
 recently. I think i have stared
 to have problem after SPARK-11073 where the HashCodes import was added.

 Regards,
 Dilip Biswal
 Tel: 408-463-4980
 dbis...@us.ibm.com



 From:Ted Yu 
 To:Dilip Biswal/Oakland/IBM@IBMUS
 Cc:Jean-Baptiste Onofré , "dev@spark.apache.org"
 
 Date:11/05/2015 10:46 AM
 Subject:Re: Master build fails ?
 



 Dilip:
 Can you give the command you used ?

 Which release were you building ?
 What OS did you build on ?

 Cheers

 On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:
 Hello,

 I am getting the same build error about not being able tofind
 com.google.common.hash.HashCodes.


 Is there a solution to this ?

 Regards,
 Dilip Biswal
 Tel: 408-463-4980
 dbis...@us.ibm.com



 From:Jean-Baptiste Onofré 
 To:Ted Yu 
 Cc:"dev@spark.apache.org" 
 Date:11/03/2015 07:20 AM
 Subject:Re: Master build fails ?
 



 Hi Ted,

 thanks for the update. The build with sbt is in progress on my box.

 Regards
 JB

 On 11/03/2015 03:31 PM, Ted Yu wrote:
 > Interesting, Sbt builds were not all failing:
 >
 > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
 >
 > FYI
 >
 > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  > wrote:

 >
 > Hi Jacek,
 >
 > it works fine with mvn: the problem is with sbt.
 >
 > I suspect a different reactor order in sbt compare to mvn.
 >
 > Regards
 > JB
 >
 > On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
 >
 > Hi,
 >
 > Just built the sources using the following command and it worked
 > fine.
 >
 > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
 > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
 > -DskipTests clean install
 > ...
 > [INFO]
 >
 > 
 > [INFO] BUILD SUCCESS
 > [INFO]
 >
 > 
 > [INFO] Total time: 14:15 min
 > [INFO] Finished at: 2015-11-03T14:40:40+01:00
 > [INFO] Final Memory: 438M/1972M
 > [INFO]
 >
 > 
 >
 > ➜  spark git:(master) ✗ java -version
 > java version "1.8.0_66"
 > Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
 > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
 >
 > I'm on Mac OS.
 >
 > Pozdrawiam,
 > Jacek
 >
 > --
 > Jacek Laskowski |
 http://blog.japila.pl|
 > http://blog.jaceklaskowski.pl

 > Follow me 

Re: [BUILD SYSTEM] quick jenkins downtime, november 5th 7am

2015-11-05 Thread shane knapp
well, i forgot to put this on my calendar and didn't get around to
getting it done this morning.  :)

anyways, i'll be shooting for tomorrow (friday) morning instead.

shane

On Mon, Nov 2, 2015 at 9:55 AM, shane knapp  wrote:
> i'd like to take jenkins down briefly thursday morning to install some
> plugin updates.
>
> this will hopefully be short (~1hr), but could easily become longer as
> the jenkins plugin ecosystem is fragile and updates like this are
> known to cause things to explode.  the only reason why i'm
> contemplating this, is i'm having some issues with the git plugin on
> new github pull request builder builds.
>
> i'll send updates as things progress.
>
> thanks,
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Josh Rosen
You could try running PySpark's own unit tests. Try ./python/run-tests
--help for instructions.

On Thu, Nov 5, 2015 at 12:31 AM Chang Ya-Hsuan  wrote:

> I've test on following pypy version against to spark-1.5.1
>
>   pypy-2.2.1
>   pypy-2.3
>   pypy-2.3.1
>   pypy-2.4.0
>   pypy-2.5.0
>   pypy-2.5.1
>   pypy-2.6.0
>   pypy-2.6.1
>
> I run
>
> $ PYSPARK_PYTHON=/path/to/pypy-xx.xx/bin/pypy
> /path/to/spark-1.5.1/bin/pyspark
>
> and only pypy-2.2.1 failed.
>
> Any suggestion to run advanced test?
>
> On Thu, Nov 5, 2015 at 4:14 PM, Chang Ya-Hsuan  wrote:
>
>> Thanks for your quickly reply.
>>
>> I will test several pypy versions and report the result later.
>>
>> On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen  wrote:
>>
>>> I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
>>> docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
>>> version to see if that works?
>>>
>>> I just checked and it looks like our Jenkins tests are running against
>>> PyPy 2.5.1, so that version is known to work. I'm not sure what the actual
>>> minimum supported PyPy version is. Would you be interested in helping to
>>> investigate so that we can update the documentation or produce a fix to
>>> restore compatibility with earlier PyPy builds?
>>>
>>> On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan 
>>> wrote:
>>>
 Hi all,

 I am trying to run pyspark with pypy, and it is work when using
 spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1

 my pypy version:

 $ /usr/bin/pypy --version
 Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
 [PyPy 2.2.1 with GCC 4.8.4]

 works with spark-1.3.1

 $ PYSPARK_PYTHON=/usr/bin/pypy
 ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
 Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
 [PyPy 2.2.1 with GCC 4.8.4] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a
 loopback address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface
 eth0)
 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
 another address
 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.3.1
   /_/

 Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
 SparkContext available as sc, HiveContext available as sqlContext.
 And now for something completely different: ``Armin: "Prolog is a
 mess.", CF:
 "No, it's very cool!", Armin: "Isn't this what I said?"''
 >>>

 error message for 1.5.1

 $ PYSPARK_PYTHON=/usr/bin/pypy
 ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
 Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
 [PyPy 2.2.1 with GCC 4.8.4] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 Traceback (most recent call last):
   File "app_main.py", line 72, in run_toplevel
   File "app_main.py", line 614, in run_it
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
 line 30, in 
 import pyspark
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
 line 41, in 
 from pyspark.context import SparkContext
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
 line 26, in 
 from pyspark import accumulators
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
 line 98, in 
 from pyspark.serializers import read_int, PickleSerializer
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
 line 400, in 
 _hijack_namedtuple()
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
 line 378, in _hijack_namedtuple
 _old_namedtuple = _copy_func(collections.namedtuple)
   File
 "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
 line 376, in _copy_func
 f.__defaults__, f.__closure__)
 AttributeError: 'function' object has no attribute '__closure__'
 And now for something completely different: ``the traces don't lie''

 is this a known issue? any suggestion to resolve it? or how can I help
 to fix this problem?

 Thanks.

>>>
>>>
>>
>>
>> --
>> -- 張雅軒
>>
>
>
>
> --
> -- 張雅軒
>


Re: Master build fails ?

2015-11-05 Thread Dilip Biswal
Hello,

I am getting the same build error about not being able to find 
com.google.common.hash.HashCodes.

Is there a solution to this ?

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Jean-Baptiste Onofré 
To: Ted Yu 
Cc: "dev@spark.apache.org" 
Date:   11/03/2015 07:20 AM
Subject:Re: Master build fails ?



Hi Ted,

thanks for the update. The build with sbt is in progress on my box.

Regards
JB

On 11/03/2015 03:31 PM, Ted Yu wrote:
> Interesting, Sbt builds were not all failing:
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
>
> FYI
>
> On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  > wrote:
>
> Hi Jacek,
>
> it works fine with mvn: the problem is with sbt.
>
> I suspect a different reactor order in sbt compare to mvn.
>
> Regards
> JB
>
> On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>
> Hi,
>
> Just built the sources using the following command and it worked
> fine.
>
> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> -DskipTests clean install
> ...
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 14:15 min
> [INFO] Finished at: 2015-11-03T14:40:40+01:00
> [INFO] Final Memory: 438M/1972M
> [INFO]
> 
>
> ➜  spark git:(master) ✗ java -version
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>
> I'm on Mac OS.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | http://blog.japila.pl |
> http://blog.jaceklaskowski.pl
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> > wrote:
>
> Thanks for the update, I used mvn to build but without hive
> profile.
>
> Let me try with mvn with the same options as you and sbt 
also.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/03/2015 12:55 PM, Jeff Zhang wrote:
>
>
> I found it is due to SPARK-11073.
>
> Here's the command I used to build
>
> build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive
> -Phive-thriftserver
> -Psparkr
>
> On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré
> 
> >> 
wrote:
>
>   Hi Jeff,
>
>   it works for me (with skipping the tests).
>
>   Let me try again, just to be sure.
>
>   Regards
>   JB
>
>
>   On 11/03/2015 11:50 AM, Jeff Zhang wrote:
>
>   Looks like it's due to guava version
> conflicts, I see both guava
>   14.0.1
>   and 16.0.1 under lib_managed/bundles. Anyone
> meet this issue too ?
>
>   [error]
>
> 
/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>   object HashCodes is not a member of package
> com.google.common.hash
>   [error] import 
com.google.common.hash.HashCodes
>   [error]^
>   [info] Resolving
> org.apache.commons#commons-math;2.2 ...
>   [error]
>
> 
/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384:
>   not found: value HashCodes
>   [error] val cookie =
> HashCodes.fromBytes(secret).toString()
>   [error]  ^
>
>
>
>
>   --
>   Best Regards
>
>   Jeff Zhang
>
>
>   --
>   Jean-Baptiste Onofré
> jbono...@apache.org 
> >
> http://blog.nanthrax.net
>   Talend - 

Re: Master build fails ?

2015-11-05 Thread Marcelo Vanzin
Seems like it's an sbt issue, not a maven one, so "dependency:tree"
might not help. Still, the command line would be helpful. I use sbt
and don't see this.

On Thu, Nov 5, 2015 at 10:44 AM, Marcelo Vanzin  wrote:
> Hi Jeff,
>
> On Tue, Nov 3, 2015 at 2:50 AM, Jeff Zhang  wrote:
>> Looks like it's due to guava version conflicts, I see both guava 14.0.1 and
>> 16.0.1 under lib_managed/bundles. Anyone meet this issue too ?
>
> What command line are you using to build? Can you run "mvn
> dependency:tree" (with all the other options you're using) to figure
> out where guava 16 is coming from? Locally I only see version 14,
> compiling against hadoop 2.5.0.
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6 Release Schedule

2015-11-05 Thread Michael Armbrust
Sorry for the delay due to traveling...

The branch has been cut.  At this point anything that we want to go into
Spark 1.6 will need to be cherry-picked.  Please be cautious when doing so,
and contact me if you are uncertain.

Michael

On Sun, Nov 1, 2015 at 4:16 AM, Sean Owen  wrote:

> I like the idea, but I think there's already a lot of triage backlog. Can
> we more concretely address this now and during the next two weeks?
>
> 1.6.0 stats from JIRA:
>
> 344 issues targeted at 1.6.0, of which
>   253 are from committers, of which
> 215 are improvements/other, of which
>5 are blockers
> 38 are bugs, of which
>4 are blockers
>   11 are critical
>
> Tip: It's really easy to manage saved queries for this and other things
> with the free JIRA Client (http://almworks.com/jiraclient/overview.html)
> that now works with Java 8.
>
> It still looks like a lot for a point where 1.6.0 is supposed to be being
> tested in theory. Lots of (most?) things that were said to be done for
> 1.6.0 for several months aren't going to be, and that still surprises me as
> a software development practice.
>
> Well, life is busy and chaotic out here in OSS land. I'd still like to
> push even more on lightweight triage and release planning, centering around
> Target Version, if only to make visible what's happening with intention and
> reality:
>
> 1. Any JIRAs that seem to have been targeted at 1.6.0 by a non-committer
> are untargeted, as they shouldn't be to begin with
>
> 2. This week, maintainers and interested parties review all JIRAs targeted
> at 1.6.0 and untarget/retarget accordingly
>
> 3. Start of next week (the final days before an RC), non-Blocker non-bugs
> untargeted, or in a few cases pushed to 1.6.1 or beyond
>
> 4. After next week, non-Blocker and non-Critical bugs are pushed, as the
> RC is then late.
>
> 5. No release candidate until no Blockers are open.
>
> 6. (Repeat 1 and 2 more regularly through the development period for 1.7
> instead of at the end.)
>
> On Sat, Oct 31, 2015 at 11:25 AM, Michael Armbrust  > wrote:
>
>> Hey All,
>>
>> Just a friendly reminder that today (October 31st) is the scheduled code
>> freeze for Spark 1.6.  Since a lot of developers were busy with the Spark
>> Summit last week I'm going to delay cutting the branch until Monday,
>> November 2nd.  After that point, we'll package a release for testing and
>> then go into the normal triage process where bugs are prioritized and some
>> smaller features are allowed in on a case by case basis (if they are very
>> low risk/additive/feature flagged/etc).
>>
>> As a reminder, release window dates are always maintained on the wiki and
>> are updated after each release according to our 3 month release cadence:
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>
>> Thanks!
>>
>> Michael
>>
>>
>


Re: How to force statistics calculation of Dataframe?

2015-11-05 Thread Reynold Xin
If your data came from RDDs (i.e. not a file system based data source), and
you don't want to cache, then no 


On Wed, Nov 4, 2015 at 3:51 PM, Charmee Patel  wrote:

> Due to other reasons we are using spark sql, not dataframe api. I saw that
> broadcast hint is only available on dataframe api.
>
> On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin  wrote:
>
>> Can you use the broadcast hint?
>>
>> e.g.
>>
>> df1.join(broadcast(df2))
>>
>> the broadcast function is in org.apache.spark.sql.functions
>>
>>
>>
>> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> If I have a hive table, analyze table compute statistics will ensure
>>> Spark SQL has statistics of that table. When I have a dataframe, is there a
>>> way to force spark to collect statistics?
>>>
>>> I have a large lookup file and I am trying to avoid a broadcast join by
>>> applying a filter before hand. This filtered RDD does not have statistics
>>> and so catalyst does not force a broadcast join. Unfortunately I have to
>>> use spark sql and cannot use dataframe api so cannot give a broadcast hint
>>> in the join.
>>>
>>> Example is this -
>>> If filtered RDD is saved as a table and compute stats is run, statistics
>>> are
>>>
>>> test.queryExecution.analyzed.statistics
>>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(38851747)
>>>
>>>
>>> filtered RDD as is gives
>>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(58403444019505585)
>>>
>>> filtered RDD forced to be materialized (cache/count), causes a different
>>> issue. Executors goes in a deadlock type state where not a single thread
>>> runs - for hours. I suspect cache a dataframe + broadcast join on same
>>> dataframe does this. As soon as cache is removed, the job moves forward.
>>>
>>> If there was a way for me to force statistics collection without caching
>>> a dataframe so Spark SQL would use it in a broadcast join?
>>>
>>> Thanks,
>>> Charmee
>>>
>>
>>


Re: Master build fails ?

2015-11-05 Thread Ted Yu
Dilip:
Can you give the command you used ?

Which release were you building ?
What OS did you build on ?

Cheers

On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:

> Hello,
>
> I am getting the same build error about not being able to find
> com.google.common.hash.HashCodes.
>
> Is there a solution to this ?
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> From:Jean-Baptiste Onofré 
> To:Ted Yu 
> Cc:"dev@spark.apache.org" 
> Date:11/03/2015 07:20 AM
> Subject:Re: Master build fails ?
> --
>
>
>
> Hi Ted,
>
> thanks for the update. The build with sbt is in progress on my box.
>
> Regards
> JB
>
> On 11/03/2015 03:31 PM, Ted Yu wrote:
> > Interesting, Sbt builds were not all failing:
> >
> > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
> >
> > FYI
> >
> > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  > >> wrote:
>
> >
> > Hi Jacek,
> >
> > it works fine with mvn: the problem is with sbt.
> >
> > I suspect a different reactor order in sbt compare to mvn.
> >
> > Regards
> > JB
> >
> > On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
> >
> > Hi,
> >
> > Just built the sources using the following command and it worked
> > fine.
> >
> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> > -DskipTests clean install
> > ...
> > [INFO]
> >
> 
> > [INFO] BUILD SUCCESS
> > [INFO]
> >
> 
> > [INFO] Total time: 14:15 min
> > [INFO] Finished at: 2015-11-03T14:40:40+01:00
> > [INFO] Final Memory: 438M/1972M
> > [INFO]
> >
> 
> >
> > ➜  spark git:(master) ✗ java -version
> > java version "1.8.0_66"
> > Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
> >
> > I'm on Mac OS.
> >
> > Pozdrawiam,
> > Jacek
> >
> > --
> > Jacek Laskowski |
> http://blog.japila.pl|
> > http://blog.jaceklaskowski.pl
>
> > Follow me at https://twitter.com/jaceklaskowski
> > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
> >
> >
> > On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> > >>
> wrote:
> >
> > Thanks for the update, I used mvn to build but without hive
> > profile.
> >
> > Let me try with mvn with the same options as you and sbt
> also.
> >
> > I keep you posted.
> >
> > Regards
> > JB
> >
> > On 11/03/2015 12:55 PM, Jeff Zhang wrote:
> >
> >
> > I found it is due to SPARK-11073.
> >
> > Here's the command I used to build
> >
> > build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive
> > -Phive-thriftserver
> > -Psparkr
> >
> > On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré
> >  >
> > <
> mailto:j...@nanthrax.net 
> >
> >   Hi Jeff,
> >
> >   it works for me (with skipping the tests).
> >
> >   Let me try again, just to be sure.
> >
> >   Regards
> >   JB
> >
> >
> >   On 11/03/2015 11:50 AM, Jeff Zhang wrote:
> >
> >   Looks like it's due to guava version
> > conflicts, I see both guava
> >   14.0.1
> >   and 16.0.1 under lib_managed/bundles. Anyone
> > meet this issue too ?
> >
> >   [error]
> >
> >
> /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
> >   object HashCodes is not a member of package
> > com.google.common.hash
> >   [error] import com.google.common.hash.HashCodes
> >   [error]^
> >   [info] Resolving
> > org.apache.commons#commons-math;2.2 ...
> >   [error]
> >
> >
> /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384:
> >

Re: Master build fails ?

2015-11-05 Thread Marcelo Vanzin
Hi Jeff,

On Tue, Nov 3, 2015 at 2:50 AM, Jeff Zhang  wrote:
> Looks like it's due to guava version conflicts, I see both guava 14.0.1 and
> 16.0.1 under lib_managed/bundles. Anyone meet this issue too ?

What command line are you using to build? Can you run "mvn
dependency:tree" (with all the other options you're using) to figure
out where guava 16 is coming from? Locally I only see version 14,
compiling against hadoop 2.5.0.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Master build fails ?

2015-11-05 Thread Dilip Biswal
Hello Ted,

Thanks for your response.

Here is the command i used :

build/sbt clean
build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
-Dhadoop.version=2.6.0 -DskipTests assembly

I am building on CentOS and on master branch.

One other thing, i was able to build fine with the above command up until 
recently. I think i have stared
to have problem after SPARK-11073 where the HashCodes import was added.

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Ted Yu 
To: Dilip Biswal/Oakland/IBM@IBMUS
Cc: Jean-Baptiste Onofré , "dev@spark.apache.org" 

Date:   11/05/2015 10:46 AM
Subject:Re: Master build fails ?



Dilip:
Can you give the command you used ?

Which release were you building ?
What OS did you build on ?

Cheers

On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:
Hello,

I am getting the same build error about not being able to find 
com.google.common.hash.HashCodes.

Is there a solution to this ?

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:Jean-Baptiste Onofré 
To:Ted Yu 
Cc:"dev@spark.apache.org" 
Date:11/03/2015 07:20 AM
Subject:Re: Master build fails ?



Hi Ted,

thanks for the update. The build with sbt is in progress on my box.

Regards
JB

On 11/03/2015 03:31 PM, Ted Yu wrote:
> Interesting, Sbt builds were not all failing:
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
>
> FYI
>
> On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  > wrote:

>
> Hi Jacek,
>
> it works fine with mvn: the problem is with sbt.
>
> I suspect a different reactor order in sbt compare to mvn.
>
> Regards
> JB
>
> On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>
> Hi,
>
> Just built the sources using the following command and it worked
> fine.
>
> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> -DskipTests clean install
> ...
> [INFO]
> 

> [INFO] BUILD SUCCESS
> [INFO]
> 

> [INFO] Total time: 14:15 min
> [INFO] Finished at: 2015-11-03T14:40:40+01:00
> [INFO] Final Memory: 438M/1972M
> [INFO]
> 

>
> ➜  spark git:(master) ✗ java -version
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>
> I'm on Mac OS.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | 
http://blog.japila.pl|
> http://blog.jaceklaskowski.pl

> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

>
>
> On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> > wrote:
>
> Thanks for the update, I used mvn to build but without hive
> profile.
>
> Let me try with mvn with the same options as you and sbt 
also.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/03/2015 12:55 PM, Jeff Zhang wrote:
>
>
> I found it is due to SPARK-11073.
>
> Here's the command I used to build
>
> build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive
> -Phive-thriftserver
> -Psparkr
>
> On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré
> 
> >> wrote:

>
>   Hi Jeff,
>
>   it works for me (with skipping the tests).
>
>   Let me try again, just to be sure.
>
>   Regards
>   JB
>
>
>   On 11/03/2015 11:50 AM, Jeff Zhang wrote:
>
>   Looks like it's due to guava version
> conflicts, I see both guava
>   14.0.1
>   and 16.0.1 under lib_managed/bundles. Anyone
> meet this issue too ?
>
>   [error]
>
> 
/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>   object HashCodes is not a member of package
> com.google.common.hash
>   [error] 

Re: Master build fails ?

2015-11-05 Thread Marcelo Vanzin
Answering my own question: "dependency-graph"

On Thu, Nov 5, 2015 at 11:44 AM, Marcelo Vanzin  wrote:
> Does anyone know how to get something similar to "mvn dependency:tree" from 
> sbt?
>
> mvn dependency:tree with hadoop 2.6.0 does not show any instances of guava 
> 16...
>
> On Thu, Nov 5, 2015 at 11:37 AM, Ted Yu  wrote:
>> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
>> -Dhadoop.version=2.6.0 -DskipTests assembly
>>
>> The above command fails on Mac.
>>
>> build/sbt -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -Pkinesis-asl
>> -DskipTests assembly
>>
>> The above command, used by Jenkins, passes.
>> That's why the build error wasn't caught.
>>
>> FYI
>>
>> On Thu, Nov 5, 2015 at 11:07 AM, Dilip Biswal  wrote:
>>>
>>> Hello Ted,
>>>
>>> Thanks for your response.
>>>
>>> Here is the command i used :
>>>
>>> build/sbt clean
>>> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
>>> -Dhadoop.version=2.6.0 -DskipTests assembly
>>>
>>> I am building on CentOS and on master branch.
>>>
>>> One other thing, i was able to build fine with the above command up until
>>> recently. I think i have stared
>>> to have problem after SPARK-11073 where the HashCodes import was added.
>>>
>>> Regards,
>>> Dilip Biswal
>>> Tel: 408-463-4980
>>> dbis...@us.ibm.com
>>>
>>>
>>>
>>> From:Ted Yu 
>>> To:Dilip Biswal/Oakland/IBM@IBMUS
>>> Cc:Jean-Baptiste Onofré , "dev@spark.apache.org"
>>> 
>>> Date:11/05/2015 10:46 AM
>>> Subject:Re: Master build fails ?
>>> 
>>>
>>>
>>>
>>> Dilip:
>>> Can you give the command you used ?
>>>
>>> Which release were you building ?
>>> What OS did you build on ?
>>>
>>> Cheers
>>>
>>> On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:
>>> Hello,
>>>
>>> I am getting the same build error about not being able tofind
>>> com.google.common.hash.HashCodes.
>>>
>>>
>>> Is there a solution to this ?
>>>
>>> Regards,
>>> Dilip Biswal
>>> Tel: 408-463-4980
>>> dbis...@us.ibm.com
>>>
>>>
>>>
>>> From:Jean-Baptiste Onofré 
>>> To:Ted Yu 
>>> Cc:"dev@spark.apache.org" 
>>> Date:11/03/2015 07:20 AM
>>> Subject:Re: Master build fails ?
>>> 
>>>
>>>
>>>
>>> Hi Ted,
>>>
>>> thanks for the update. The build with sbt is in progress on my box.
>>>
>>> Regards
>>> JB
>>>
>>> On 11/03/2015 03:31 PM, Ted Yu wrote:
>>> > Interesting, Sbt builds were not all failing:
>>> >
>>> > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
>>> >
>>> > FYI
>>> >
>>> > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré >> > > wrote:
>>>
>>> >
>>> > Hi Jacek,
>>> >
>>> > it works fine with mvn: the problem is with sbt.
>>> >
>>> > I suspect a different reactor order in sbt compare to mvn.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>>> >
>>> > Hi,
>>> >
>>> > Just built the sources using the following command and it worked
>>> > fine.
>>> >
>>> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
>>> > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
>>> > -DskipTests clean install
>>> > ...
>>> > [INFO]
>>> >
>>> > 
>>> > [INFO] BUILD SUCCESS
>>> > [INFO]
>>> >
>>> > 
>>> > [INFO] Total time: 14:15 min
>>> > [INFO] Finished at: 2015-11-03T14:40:40+01:00
>>> > [INFO] Final Memory: 438M/1972M
>>> > [INFO]
>>> >
>>> > 
>>> >
>>> > ➜  spark git:(master) ✗ java -version
>>> > java version "1.8.0_66"
>>> > Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
>>> > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>>> >
>>> > I'm on Mac OS.
>>> >
>>> > Pozdrawiam,
>>> > Jacek
>>> >
>>> > --
>>> > Jacek Laskowski |
>>> http://blog.japila.pl|
>>> > http://blog.jaceklaskowski.pl
>>>
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>>
>>> >
>>> >
>>> > On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
>>> > > wrote:
>>> >
>>> > Thanks for the update, I used mvn to build but without hive
>>> > profile.
>>> >
>>> > Let me try with mvn with the same options as you and sbt
>>> > also.
>>> >
>>> > I keep you posted.
>>> >

Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Jörn Franke
Would it be possible to use views to address some of your requirements?

Alternatively it might be better to parse it yourself. There are open source 
libraries for it, if you need really a complete sql parser. Do you want to do 
it on sub queries?

> On 05 Nov 2015, at 23:34, Yana Kadiyska  wrote:
> 
> Hi folks, not sure if this belongs to dev or user list..sending to dev as it 
> seems a bit convoluted.
> 
> I have a UI in which we allow users to write ad-hoc queries against a (very 
> large, partitioned) table. I would like to analyze the queries prior to 
> execution for two purposes:
> 
> 1. Reject under-constrained queries (i.e. there is a field predicate that I 
> want to make sure is always present)
> 2. Augment the query with additional predicates (e.g if the user asks for a 
> student_id I also want to push a constraint on another field)
> 
> I could parse the sql string before passing to spark but obviously spark 
> already does this anyway. Can someone give me general direction on how to do 
> this (if possible).
> 
> Something like
> 
> myDF = sql("user_sql_query")
> myDF.queryExecution.logical  //here examine the filters provided by user, 
> reject if underconstrained, push new filters as needed (via withNewChildren?)
>  
> at this point with some luck I'd have a new LogicalPlan -- what is the proper 
> way to create an execution plan on top of this new Plan? Im looking at this 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
>  but this method is restricted to the package. I'd really prefer to hook into 
> this as early as possible and still let spark run the plan optimizations as 
> usual.
> 
> Any guidance or pointers much appreciated.


Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Reynold Xin
You can hack around this by constructing logical plans yourself and then
creating a DataFrame in order to execute them. Note that this is all
depending on internals of the framework and can break when Spark upgrades.


On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska 
wrote:

> I don't think a view would help -- in the case of under-constraining, I
> want to make sure that the user is constraining a column (e.g. I want to
> restrict them to querying a single partition at a time but I don't care
> which one)...a view per partition value is not practical due to the fairly
> high cardinality...
>
> In the case of predicate augmentation, the additional predicate depends on
> the value the user is providing e.g. my data is partitioned under
> teacherName but the end users don't have this information...So if they ask
> for student_id="1234" I'd like to add "teacherName='Smith'" based on a
> mapping that is not surfaced to the user (sorry for the contrived
> example)...But I don't think I can do this with a view. A join will produce
> the right answer but is counter-productive as my goal is to minimize the
> partitions being processed.
>
> I can parse the query myself -- I was not fond of this solution as I'd go
> sql string to parse tree back to augmented sql string only to have spark
> repeat the first part of the exercisebut will do if need be. And yes,
> I'd have to be able to process sub-queries too...
>
> On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke  wrote:
>
>> Would it be possible to use views to address some of your requirements?
>>
>> Alternatively it might be better to parse it yourself. There are open
>> source libraries for it, if you need really a complete sql parser. Do you
>> want to do it on sub queries?
>>
>> On 05 Nov 2015, at 23:34, Yana Kadiyska  wrote:
>>
>> Hi folks, not sure if this belongs to dev or user list..sending to dev as
>> it seems a bit convoluted.
>>
>> I have a UI in which we allow users to write ad-hoc queries against a
>> (very large, partitioned) table. I would like to analyze the queries prior
>> to execution for two purposes:
>>
>> 1. Reject under-constrained queries (i.e. there is a field predicate that
>> I want to make sure is always present)
>> 2. Augment the query with additional predicates (e.g if the user asks for
>> a student_id I also want to push a constraint on another field)
>>
>> I could parse the sql string before passing to spark but obviously spark
>> already does this anyway. Can someone give me general direction on how to
>> do this (if possible).
>>
>> Something like
>>
>> myDF = sql("user_sql_query")
>> myDF.queryExecution.logical  //here examine the filters provided by
>> user, reject if underconstrained, push new filters as needed (via
>> withNewChildren?)
>>
>> at this point with some luck I'd have a new LogicalPlan -- what is the
>> proper way to create an execution plan on top of this new Plan? Im looking
>> at this
>> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
>> but this method is restricted to the package. I'd really prefer to hook
>> into this as early as possible and still let spark run the plan
>> optimizations as usual.
>>
>> Any guidance or pointers much appreciated.
>>
>>
>


Re: State of the Build

2015-11-05 Thread Ted Yu
See previous discussion:
http://search-hadoop.com/m/q3RTtPnPnzwOhBr

FYI

On Thu, Nov 5, 2015 at 4:30 PM, Stephen Boesch  wrote:

> Yes. The current dev/change-scala-version.sh mutates (/pollutes) the build
> environment by updating the pom.xml in each of the subprojects. If you were
> able to come up with a structure that avoids that approach it would be an
> improvement.
>
> 2015-11-05 15:38 GMT-08:00 Jakob Odersky :
>
>> Hi everyone,
>> in the process of learning Spark, I wanted to get an overview of the
>> interaction between all of its sub-projects. I therefore decided to have a
>> look at the build setup and its dependency management.
>> Since I am alot more comfortable using sbt than maven, I decided to try
>> to port the maven configuration to sbt (with the help of automated tools).
>> This led me to a couple of observations and questions on the build system
>> design:
>>
>> First, currently, there are two build systems, maven and sbt. Is there a
>> preferred tool (or future direction to one)?
>>
>> Second, the sbt build also uses maven "profiles" requiring the use of
>> specific commandline parameters when starting sbt. Furthermore, since it
>> relies on maven poms, dependencies to the scala binary version (_2.xx) are
>> hardcoded and require running an external script when switching versions.
>> Sbt could leverage built-in constructs to support cross-compilation and
>> emulate profiles with configurations and new build targets. This would
>> remove external state from the build (in that no extra steps need to be
>> performed in a particular order to generate artifacts for a new
>> configuration) and therefore improve stability and build reproducibility
>> (maybe even build performance). I was wondering if implementing such
>> functionality for the sbt build would be welcome?
>>
>> thanks,
>> --Jakob
>>
>
>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Christian
I created the cluster with the following:

--hadoop-major-version=2
--spark-version=1.4.1

from: spark-1.5.1-bin-hadoop1

Are you saying there might be different behavior if I download
spark-1.5.1-hadoop-2.6 and create my cluster?

On Thu, Nov 5, 2015 at 1:28 PM, Christian  wrote:

> Spark 1.5.1-hadoop1
>
> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> > I am using both 1.4.1 and 1.5.1.
>>
>> That's the Spark version. I'm wondering what version of Hadoop your Spark
>> is built against.
>>
>> For example, when you download Spark
>>  you have to select from a
>> number of packages (under "Choose a package type"), and each is built
>> against a different version of Hadoop. When Spark is built against Hadoop
>> 2.6+, from my understanding, you need to install additional libraries
>>  to access S3. When
>> Spark is built against Hadoop 2.4 or earlier, you don't need to do this.
>>
>> I'm confirming that this is what is happening in your case.
>>
>> Nick
>>
>> On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:
>>
>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of
>>> the new feature for instance-profile which greatly helps with this as well.
>>> Without the instance-profile, we got it working by copying a
>>> .aws/credentials file up to each node. We could easily automate that
>>> through the templates.
>>>
>>> I don't need any additional libraries. We just need to change the
>>> core-site.xml
>>>
>>> -Christian
>>>
>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Thanks for sharing this, Christian.

 What build of Spark are you using? If I understand correctly, if you
 are using Spark built against Hadoop 2.6+ then additional configs alone
 won't help because additional libraries also need to be installed
 .

 Nick

 On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:

> We ended up reading and writing to S3 a ton in our Spark jobs.
> For this to work, we ended up having to add s3a, and s3 key/secret
> pairs. We also had to add fs.hdfs.impl to get these things to work.
>
> I thought maybe I'd share what we did and it might be worth adding
> these to the spark conf for out of the box functionality with S3.
>
> We created:
>
> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>
> We changed the contents form the original, adding in the following:
>
>   
> fs.file.impl
> org.apache.hadoop.fs.LocalFileSystem
>   
>
>   
> fs.hdfs.impl
> org.apache.hadoop.hdfs.DistributedFileSystem
>   
>
>   
> fs.s3.impl
> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>   
>
>   
> fs.s3.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3n.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3n.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3a.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3a.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
> This change makes spark on ec2 work out of the box for us. It took us
> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
> version 2.
>
> Best Regards,
> Christian
>

>>>
>


State of the Build

2015-11-05 Thread Jakob Odersky
Hi everyone,
in the process of learning Spark, I wanted to get an overview of the
interaction between all of its sub-projects. I therefore decided to have a
look at the build setup and its dependency management.
Since I am alot more comfortable using sbt than maven, I decided to try to
port the maven configuration to sbt (with the help of automated tools).
This led me to a couple of observations and questions on the build system
design:

First, currently, there are two build systems, maven and sbt. Is there a
preferred tool (or future direction to one)?

Second, the sbt build also uses maven "profiles" requiring the use of
specific commandline parameters when starting sbt. Furthermore, since it
relies on maven poms, dependencies to the scala binary version (_2.xx) are
hardcoded and require running an external script when switching versions.
Sbt could leverage built-in constructs to support cross-compilation and
emulate profiles with configurations and new build targets. This would
remove external state from the build (in that no extra steps need to be
performed in a particular order to generate artifacts for a new
configuration) and therefore improve stability and build reproducibility
(maybe even build performance). I was wondering if implementing such
functionality for the sbt build would be welcome?

thanks,
--Jakob


Need advice on hooking into Sql query plan

2015-11-05 Thread Yana Kadiyska
Hi folks, not sure if this belongs to dev or user list..sending to dev as
it seems a bit convoluted.

I have a UI in which we allow users to write ad-hoc queries against a (very
large, partitioned) table. I would like to analyze the queries prior to
execution for two purposes:

1. Reject under-constrained queries (i.e. there is a field predicate that I
want to make sure is always present)
2. Augment the query with additional predicates (e.g if the user asks for a
student_id I also want to push a constraint on another field)

I could parse the sql string before passing to spark but obviously spark
already does this anyway. Can someone give me general direction on how to
do this (if possible).

Something like

myDF = sql("user_sql_query")
myDF.queryExecution.logical  //here examine the filters provided by user,
reject if underconstrained, push new filters as needed (via
withNewChildren?)

at this point with some luck I'd have a new LogicalPlan -- what is the
proper way to create an execution plan on top of this new Plan? Im looking
at this
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
but this method is restricted to the package. I'd really prefer to hook
into this as early as possible and still let spark run the plan
optimizations as usual.

Any guidance or pointers much appreciated.


Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Yana Kadiyska
I don't think a view would help -- in the case of under-constraining, I
want to make sure that the user is constraining a column (e.g. I want to
restrict them to querying a single partition at a time but I don't care
which one)...a view per partition value is not practical due to the fairly
high cardinality...

In the case of predicate augmentation, the additional predicate depends on
the value the user is providing e.g. my data is partitioned under
teacherName but the end users don't have this information...So if they ask
for student_id="1234" I'd like to add "teacherName='Smith'" based on a
mapping that is not surfaced to the user (sorry for the contrived
example)...But I don't think I can do this with a view. A join will produce
the right answer but is counter-productive as my goal is to minimize the
partitions being processed.

I can parse the query myself -- I was not fond of this solution as I'd go
sql string to parse tree back to augmented sql string only to have spark
repeat the first part of the exercisebut will do if need be. And yes,
I'd have to be able to process sub-queries too...

On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke  wrote:

> Would it be possible to use views to address some of your requirements?
>
> Alternatively it might be better to parse it yourself. There are open
> source libraries for it, if you need really a complete sql parser. Do you
> want to do it on sub queries?
>
> On 05 Nov 2015, at 23:34, Yana Kadiyska  wrote:
>
> Hi folks, not sure if this belongs to dev or user list..sending to dev as
> it seems a bit convoluted.
>
> I have a UI in which we allow users to write ad-hoc queries against a
> (very large, partitioned) table. I would like to analyze the queries prior
> to execution for two purposes:
>
> 1. Reject under-constrained queries (i.e. there is a field predicate that
> I want to make sure is always present)
> 2. Augment the query with additional predicates (e.g if the user asks for
> a student_id I also want to push a constraint on another field)
>
> I could parse the sql string before passing to spark but obviously spark
> already does this anyway. Can someone give me general direction on how to
> do this (if possible).
>
> Something like
>
> myDF = sql("user_sql_query")
> myDF.queryExecution.logical  //here examine the filters provided by user,
> reject if underconstrained, push new filters as needed (via
> withNewChildren?)
>
> at this point with some luck I'd have a new LogicalPlan -- what is the
> proper way to create an execution plan on top of this new Plan? Im looking
> at this
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
> but this method is restricted to the package. I'd really prefer to hook
> into this as early as possible and still let spark run the plan
> optimizations as usual.
>
> Any guidance or pointers much appreciated.
>
>


Re: State of the Build

2015-11-05 Thread Stephen Boesch
Yes. The current dev/change-scala-version.sh mutates (/pollutes) the build
environment by updating the pom.xml in each of the subprojects. If you were
able to come up with a structure that avoids that approach it would be an
improvement.

2015-11-05 15:38 GMT-08:00 Jakob Odersky :

> Hi everyone,
> in the process of learning Spark, I wanted to get an overview of the
> interaction between all of its sub-projects. I therefore decided to have a
> look at the build setup and its dependency management.
> Since I am alot more comfortable using sbt than maven, I decided to try to
> port the maven configuration to sbt (with the help of automated tools).
> This led me to a couple of observations and questions on the build system
> design:
>
> First, currently, there are two build systems, maven and sbt. Is there a
> preferred tool (or future direction to one)?
>
> Second, the sbt build also uses maven "profiles" requiring the use of
> specific commandline parameters when starting sbt. Furthermore, since it
> relies on maven poms, dependencies to the scala binary version (_2.xx) are
> hardcoded and require running an external script when switching versions.
> Sbt could leverage built-in constructs to support cross-compilation and
> emulate profiles with configurations and new build targets. This would
> remove external state from the build (in that no extra steps need to be
> performed in a particular order to generate artifacts for a new
> configuration) and therefore improve stability and build reproducibility
> (maybe even build performance). I was wondering if implementing such
> functionality for the sbt build would be welcome?
>
> thanks,
> --Jakob
>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Christian
Spark 1.5.1-hadoop1

On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> > I am using both 1.4.1 and 1.5.1.
>
> That's the Spark version. I'm wondering what version of Hadoop your Spark
> is built against.
>
> For example, when you download Spark
>  you have to select from a number
> of packages (under "Choose a package type"), and each is built against a
> different version of Hadoop. When Spark is built against Hadoop 2.6+, from
> my understanding, you need to install additional libraries
>  to access S3. When
> Spark is built against Hadoop 2.4 or earlier, you don't need to do this.
>
> I'm confirming that this is what is happening in your case.
>
> Nick
>
> On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:
>
>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of the
>> new feature for instance-profile which greatly helps with this as well.
>> Without the instance-profile, we got it working by copying a
>> .aws/credentials file up to each node. We could easily automate that
>> through the templates.
>>
>> I don't need any additional libraries. We just need to change the
>> core-site.xml
>>
>> -Christian
>>
>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks for sharing this, Christian.
>>>
>>> What build of Spark are you using? If I understand correctly, if you are
>>> using Spark built against Hadoop 2.6+ then additional configs alone won't
>>> help because additional libraries also need to be installed
>>> .
>>>
>>> Nick
>>>
>>> On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:
>>>
 We ended up reading and writing to S3 a ton in our Spark jobs.
 For this to work, we ended up having to add s3a, and s3 key/secret
 pairs. We also had to add fs.hdfs.impl to get these things to work.

 I thought maybe I'd share what we did and it might be worth adding
 these to the spark conf for out of the box functionality with S3.

 We created:

 ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml

 We changed the contents form the original, adding in the following:

   
 fs.file.impl
 org.apache.hadoop.fs.LocalFileSystem
   

   
 fs.hdfs.impl
 org.apache.hadoop.hdfs.DistributedFileSystem
   

   
 fs.s3.impl
 org.apache.hadoop.fs.s3native.NativeS3FileSystem
   

   
 fs.s3.awsAccessKeyId
 {{aws_access_key_id}}
   

   
 fs.s3.awsSecretAccessKey
 {{aws_secret_access_key}}
   

   
 fs.s3n.awsAccessKeyId
 {{aws_access_key_id}}
   

   
 fs.s3n.awsSecretAccessKey
 {{aws_secret_access_key}}
   

   
 fs.s3a.awsAccessKeyId
 {{aws_access_key_id}}
   

   
 fs.s3a.awsSecretAccessKey
 {{aws_secret_access_key}}
   

 This change makes spark on ec2 work out of the box for us. It took us
 several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
 version 2.

 Best Regards,
 Christian

>>>
>>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Christian
Even with the changes I mentioned above?
On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas 
wrote:

> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
> cannot access S3, unfortunately.
>
> On Thu, Nov 5, 2015 at 3:53 PM Christian  wrote:
>
>> I created the cluster with the following:
>>
>> --hadoop-major-version=2
>> --spark-version=1.4.1
>>
>> from: spark-1.5.1-bin-hadoop1
>>
>> Are you saying there might be different behavior if I download
>> spark-1.5.1-hadoop-2.6 and create my cluster?
>>
>> On Thu, Nov 5, 2015 at 1:28 PM, Christian  wrote:
>>
>>> Spark 1.5.1-hadoop1
>>>
>>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 > I am using both 1.4.1 and 1.5.1.

 That's the Spark version. I'm wondering what version of Hadoop your
 Spark is built against.

 For example, when you download Spark
  you have to select from a
 number of packages (under "Choose a package type"), and each is built
 against a different version of Hadoop. When Spark is built against Hadoop
 2.6+, from my understanding, you need to install additional libraries
  to access S3. When
 Spark is built against Hadoop 2.4 or earlier, you don't need to do this.

 I'm confirming that this is what is happening in your case.

 Nick

 On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:

> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of
> the new feature for instance-profile which greatly helps with this as 
> well.
> Without the instance-profile, we got it working by copying a
> .aws/credentials file up to each node. We could easily automate that
> through the templates.
>
> I don't need any additional libraries. We just need to change the
> core-site.xml
>
> -Christian
>
> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for sharing this, Christian.
>>
>> What build of Spark are you using? If I understand correctly, if you
>> are using Spark built against Hadoop 2.6+ then additional configs alone
>> won't help because additional libraries also need to be installed
>> .
>>
>> Nick
>>
>> On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:
>>
>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>
>>> I thought maybe I'd share what we did and it might be worth adding
>>> these to the spark conf for out of the box functionality with S3.
>>>
>>> We created:
>>>
>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>
>>> We changed the contents form the original, adding in the following:
>>>
>>>   
>>> fs.file.impl
>>> org.apache.hadoop.fs.LocalFileSystem
>>>   
>>>
>>>   
>>> fs.hdfs.impl
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>   
>>>
>>>   
>>> fs.s3.impl
>>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>>   
>>>
>>>   
>>> fs.s3.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>>   
>>> fs.s3n.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3n.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>>   
>>> fs.s3a.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3a.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>> This change makes spark on ec2 work out of the box for us. It took
>>> us several days to figure this out. It works for 1.4.1 and 1.5.1 on 
>>> Hadoop
>>> version 2.
>>>
>>> Best Regards,
>>> Christian
>>>
>>
>
>>>
>>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Christian
Oh right. I forgot about the libraries being removed.
On Thu, Nov 5, 2015 at 10:35 PM Nicholas Chammas 
wrote:

> I might be mistaken, but yes, even with the changes you mentioned you will
> not be able to access S3 if Spark is built against Hadoop 2.6+ unless you
> install additional libraries. The issue is explained in SPARK-7481
>  and SPARK-7442
> .
>
> On Fri, Nov 6, 2015 at 12:22 AM Christian  wrote:
>
>> Even with the changes I mentioned above?
>> On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
>>> cannot access S3, unfortunately.
>>>
>>> On Thu, Nov 5, 2015 at 3:53 PM Christian  wrote:
>>>
 I created the cluster with the following:

 --hadoop-major-version=2
 --spark-version=1.4.1

 from: spark-1.5.1-bin-hadoop1

 Are you saying there might be different behavior if I download
 spark-1.5.1-hadoop-2.6 and create my cluster?

 On Thu, Nov 5, 2015 at 1:28 PM, Christian  wrote:

> Spark 1.5.1-hadoop1
>
> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> > I am using both 1.4.1 and 1.5.1.
>>
>> That's the Spark version. I'm wondering what version of Hadoop your
>> Spark is built against.
>>
>> For example, when you download Spark
>>  you have to select from a
>> number of packages (under "Choose a package type"), and each is built
>> against a different version of Hadoop. When Spark is built against Hadoop
>> 2.6+, from my understanding, you need to install additional libraries
>>  to access S3.
>> When Spark is built against Hadoop 2.4 or earlier, you don't need to do
>> this.
>>
>> I'm confirming that this is what is happening in your case.
>>
>> Nick
>>
>> On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:
>>
>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because
>>> of the new feature for instance-profile which greatly helps with this as
>>> well.
>>> Without the instance-profile, we got it working by copying a
>>> .aws/credentials file up to each node. We could easily automate that
>>> through the templates.
>>>
>>> I don't need any additional libraries. We just need to change the
>>> core-site.xml
>>>
>>> -Christian
>>>
>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Thanks for sharing this, Christian.

 What build of Spark are you using? If I understand correctly, if
 you are using Spark built against Hadoop 2.6+ then additional configs 
 alone
 won't help because additional libraries also need to be installed
 .

 Nick

 On Thu, Nov 5, 2015 at 11:25 AM Christian 
 wrote:

> We ended up reading and writing to S3 a ton in our Spark jobs.
> For this to work, we ended up having to add s3a, and s3 key/secret
> pairs. We also had to add fs.hdfs.impl to get these things to work.
>
> I thought maybe I'd share what we did and it might be worth adding
> these to the spark conf for out of the box functionality with S3.
>
> We created:
>
> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>
> We changed the contents form the original, adding in the following:
>
>   
> fs.file.impl
> org.apache.hadoop.fs.LocalFileSystem
>   
>
>   
> fs.hdfs.impl
> org.apache.hadoop.hdfs.DistributedFileSystem
>   
>
>   
> fs.s3.impl
> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>   
>
>   
> fs.s3.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3n.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3n.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3a.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3a.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
> This change 

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
I might be mistaken, but yes, even with the changes you mentioned you will
not be able to access S3 if Spark is built against Hadoop 2.6+ unless you
install additional libraries. The issue is explained in SPARK-7481
 and SPARK-7442
.

On Fri, Nov 6, 2015 at 12:22 AM Christian  wrote:

> Even with the changes I mentioned above?
> On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
>> cannot access S3, unfortunately.
>>
>> On Thu, Nov 5, 2015 at 3:53 PM Christian  wrote:
>>
>>> I created the cluster with the following:
>>>
>>> --hadoop-major-version=2
>>> --spark-version=1.4.1
>>>
>>> from: spark-1.5.1-bin-hadoop1
>>>
>>> Are you saying there might be different behavior if I download
>>> spark-1.5.1-hadoop-2.6 and create my cluster?
>>>
>>> On Thu, Nov 5, 2015 at 1:28 PM, Christian  wrote:
>>>
 Spark 1.5.1-hadoop1

 On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> > I am using both 1.4.1 and 1.5.1.
>
> That's the Spark version. I'm wondering what version of Hadoop your
> Spark is built against.
>
> For example, when you download Spark
>  you have to select from a
> number of packages (under "Choose a package type"), and each is built
> against a different version of Hadoop. When Spark is built against Hadoop
> 2.6+, from my understanding, you need to install additional libraries
>  to access S3. When
> Spark is built against Hadoop 2.4 or earlier, you don't need to do this.
>
> I'm confirming that this is what is happening in your case.
>
> Nick
>
> On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:
>
>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of
>> the new feature for instance-profile which greatly helps with this as 
>> well.
>> Without the instance-profile, we got it working by copying a
>> .aws/credentials file up to each node. We could easily automate that
>> through the templates.
>>
>> I don't need any additional libraries. We just need to change the
>> core-site.xml
>>
>> -Christian
>>
>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks for sharing this, Christian.
>>>
>>> What build of Spark are you using? If I understand correctly, if you
>>> are using Spark built against Hadoop 2.6+ then additional configs alone
>>> won't help because additional libraries also need to be installed
>>> .
>>>
>>> Nick
>>>
>>> On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:
>>>
 We ended up reading and writing to S3 a ton in our Spark jobs.
 For this to work, we ended up having to add s3a, and s3 key/secret
 pairs. We also had to add fs.hdfs.impl to get these things to work.

 I thought maybe I'd share what we did and it might be worth adding
 these to the spark conf for out of the box functionality with S3.

 We created:

 ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml

 We changed the contents form the original, adding in the following:

   
 fs.file.impl
 org.apache.hadoop.fs.LocalFileSystem
   

   
 fs.hdfs.impl
 org.apache.hadoop.hdfs.DistributedFileSystem
   

   
 fs.s3.impl
 org.apache.hadoop.fs.s3native.NativeS3FileSystem
   

   
 fs.s3.awsAccessKeyId
 {{aws_access_key_id}}
   

   
 fs.s3.awsSecretAccessKey
 {{aws_secret_access_key}}
   

   
 fs.s3n.awsAccessKeyId
 {{aws_access_key_id}}
   

   
 fs.s3n.awsSecretAccessKey
 {{aws_secret_access_key}}
   

   
 fs.s3a.awsAccessKeyId
 {{aws_access_key_id}}
   

   
 fs.s3a.awsSecretAccessKey
 {{aws_secret_access_key}}
   

 This change makes spark on ec2 work out of the box for us. It took
 us several days to figure this out. It works for 1.4.1 and 1.5.1 on 
 Hadoop
 version 2.

 Best Regards,
 Christian

>>>
>>

>>>


Re: State of the Build

2015-11-05 Thread Sean Owen
Maven isn't 'legacy', or supported for the benefit of third parties.
SBT had some behaviors / problems that Maven didn't relative to what
Spark needs. SBT is a development-time alternative only, and partly
generated from the Maven build.

On Fri, Nov 6, 2015 at 1:48 AM, Koert Kuipers  wrote:
> People who do upstream builds of spark (think bigtop and hadoop distros) are
> used to legacy systems like maven, so maven is the default build. I don't
> think it will change.
>
> Any improvements for the sbt build are of course welcome (it is still used
> by many developers), but i would not do anything that increases the burden
> of maintaining two build systems.
>
> On Nov 5, 2015 18:38, "Jakob Odersky"  wrote:
>>
>> Hi everyone,
>> in the process of learning Spark, I wanted to get an overview of the
>> interaction between all of its sub-projects. I therefore decided to have a
>> look at the build setup and its dependency management.
>> Since I am alot more comfortable using sbt than maven, I decided to try to
>> port the maven configuration to sbt (with the help of automated tools).
>> This led me to a couple of observations and questions on the build system
>> design:
>>
>> First, currently, there are two build systems, maven and sbt. Is there a
>> preferred tool (or future direction to one)?
>>
>> Second, the sbt build also uses maven "profiles" requiring the use of
>> specific commandline parameters when starting sbt. Furthermore, since it
>> relies on maven poms, dependencies to the scala binary version (_2.xx) are
>> hardcoded and require running an external script when switching versions.
>> Sbt could leverage built-in constructs to support cross-compilation and
>> emulate profiles with configurations and new build targets. This would
>> remove external state from the build (in that no extra steps need to be
>> performed in a particular order to generate artifacts for a new
>> configuration) and therefore improve stability and build reproducibility
>> (maybe even build performance). I was wondering if implementing such
>> functionality for the sbt build would be welcome?
>>
>> thanks,
>> --Jakob

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: State of the Build

2015-11-05 Thread Koert Kuipers
People who do upstream builds of spark (think bigtop and hadoop distros)
are used to legacy systems like maven, so maven is the default build. I
don't think it will change.

Any improvements for the sbt build are of course welcome (it is still used
by many developers), but i would not do anything that increases the burden
of maintaining two build systems.
On Nov 5, 2015 18:38, "Jakob Odersky"  wrote:

> Hi everyone,
> in the process of learning Spark, I wanted to get an overview of the
> interaction between all of its sub-projects. I therefore decided to have a
> look at the build setup and its dependency management.
> Since I am alot more comfortable using sbt than maven, I decided to try to
> port the maven configuration to sbt (with the help of automated tools).
> This led me to a couple of observations and questions on the build system
> design:
>
> First, currently, there are two build systems, maven and sbt. Is there a
> preferred tool (or future direction to one)?
>
> Second, the sbt build also uses maven "profiles" requiring the use of
> specific commandline parameters when starting sbt. Furthermore, since it
> relies on maven poms, dependencies to the scala binary version (_2.xx) are
> hardcoded and require running an external script when switching versions.
> Sbt could leverage built-in constructs to support cross-compilation and
> emulate profiles with configurations and new build targets. This would
> remove external state from the build (in that no extra steps need to be
> performed in a particular order to generate artifacts for a new
> configuration) and therefore improve stability and build reproducibility
> (maybe even build performance). I was wondering if implementing such
> functionality for the sbt build would be welcome?
>
> thanks,
> --Jakob
>


Re: Master build fails ?

2015-11-05 Thread Marcelo Vanzin
FYI I pushed a fix for this to github; so if you pull everything
should work now.

On Thu, Nov 5, 2015 at 12:07 PM, Marcelo Vanzin  wrote:
> Man that command is slow. Anyway, it seems guava 16 is being brought
> transitively by curator 2.6.0 which should have been overridden by the
> explicit dependency on curator 2.4.0, but apparently, as Steve
> mentioned, sbt/ivy decided to break things, so I'll be adding some
> exclusions.
>
> On Thu, Nov 5, 2015 at 11:55 AM, Marcelo Vanzin  wrote:
>> Answering my own question: "dependency-graph"
>>
>> On Thu, Nov 5, 2015 at 11:44 AM, Marcelo Vanzin  wrote:
>>> Does anyone know how to get something similar to "mvn dependency:tree" from 
>>> sbt?
>>>
>>> mvn dependency:tree with hadoop 2.6.0 does not show any instances of guava 
>>> 16...
>>>
>>> On Thu, Nov 5, 2015 at 11:37 AM, Ted Yu  wrote:
 build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
 -Dhadoop.version=2.6.0 -DskipTests assembly

 The above command fails on Mac.

 build/sbt -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -Pkinesis-asl
 -DskipTests assembly

 The above command, used by Jenkins, passes.
 That's why the build error wasn't caught.

 FYI

 On Thu, Nov 5, 2015 at 11:07 AM, Dilip Biswal  wrote:
>
> Hello Ted,
>
> Thanks for your response.
>
> Here is the command i used :
>
> build/sbt clean
> build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver
> -Dhadoop.version=2.6.0 -DskipTests assembly
>
> I am building on CentOS and on master branch.
>
> One other thing, i was able to build fine with the above command up until
> recently. I think i have stared
> to have problem after SPARK-11073 where the HashCodes import was added.
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> From:Ted Yu 
> To:Dilip Biswal/Oakland/IBM@IBMUS
> Cc:Jean-Baptiste Onofré , 
> "dev@spark.apache.org"
> 
> Date:11/05/2015 10:46 AM
> Subject:Re: Master build fails ?
> 
>
>
>
> Dilip:
> Can you give the command you used ?
>
> Which release were you building ?
> What OS did you build on ?
>
> Cheers
>
> On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:
> Hello,
>
> I am getting the same build error about not being able tofind
> com.google.common.hash.HashCodes.
>
>
> Is there a solution to this ?
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> From:Jean-Baptiste Onofré 
> To:Ted Yu 
> Cc:"dev@spark.apache.org" 
> Date:11/03/2015 07:20 AM
> Subject:Re: Master build fails ?
> 
>
>
>
> Hi Ted,
>
> thanks for the update. The build with sbt is in progress on my box.
>
> Regards
> JB
>
> On 11/03/2015 03:31 PM, Ted Yu wrote:
> > Interesting, Sbt builds were not all failing:
> >
> > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
> >
> > FYI
> >
> > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  > > wrote:
>
> >
> > Hi Jacek,
> >
> > it works fine with mvn: the problem is with sbt.
> >
> > I suspect a different reactor order in sbt compare to mvn.
> >
> > Regards
> > JB
> >
> > On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
> >
> > Hi,
> >
> > Just built the sources using the following command and it worked
> > fine.
> >
> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> > -DskipTests clean install
> > ...
> > [INFO]
> >
> > 
> > [INFO] BUILD SUCCESS
> > [INFO]
> >
> > 
> > [INFO] Total time: 14:15 min
> > [INFO] Finished at: 2015-11-03T14:40:40+01:00
> > [INFO] Final Memory: 438M/1972M
> > [INFO]
> >
> > 
> >
> > ➜  spark git:(master) ✗ java -version
> > java version "1.8.0_66"
> > Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> 

RE: dataframe slow down with tungsten turn on

2015-11-05 Thread Cheng, Hao
What’s the big size of the raw data and the result data? Is that any other 
changes like HDFS, Spark configuration, your own code etc. besides the Spark 
binary? Can you monitor the IO/CPU state while executing the final stage, and 
it will be great if you can paste the call stack if you observe the high CPU 
utilization.

And can you try not to cache anything and repeat the same step? Just be sure 
it’s not caused by the memory stuff.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Friday, November 6, 2015 12:18 AM
To: dev@spark.apache.org
Subject: Fwd: dataframe slow down with tungsten turn on


-- Forwarded message --
From: gen tang >
Date: Fri, Nov 6, 2015 at 12:14 AM
Subject: Re: dataframe slow down with tungsten turn on
To: "Cheng, Hao" >

Hi,

My application is as follows:
1. create dataframe from hive table
2. transform dataframe to rdd of json and do some aggregations on json (in 
fact, I use pyspark, so it is rdd of dict)
3. retransform rdd of json to dataframe and cache it (triggered by count)
4. join several dataframe which is created by the above steps.
5. save final dataframe as json.(by dataframe write api)

There are a lot of stages, other stage is quite the same under two version of 
spark. However, the final step (save as json) is 1 min vs. 2 hour. In my 
opinion, I think it is writing to hdfs cause the slowness of final stage. 
However, I don't know why...

In fact, I make a mistake about the version of spark that I used. The spark 
which runs faster is build on source code of spark 1.4.1. The spark which runs 
slower is build on source code of spark 1.5.2, 2 days ago.

Any idea? Thanks a lot.

Cheers
Gen


On Thu, Nov 5, 2015 at 1:01 PM, Cheng, Hao 
> wrote:
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


-- Forwarded message --
From: gen tang >
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org" 
>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen






Re: State of the Build

2015-11-05 Thread Mark Hamstra
There was a lot of discussion that preceded our arriving at this statement
in the Spark documentation: "Maven is the official build tool recommended
for packaging Spark, and is the build of reference."
https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt

I'm not aware of anything new in the way of SBT tooling or your post,
Jakob, that would lead us to reconsider the choice of Maven over SBT for
the reference build of Spark.  Of course, I'm by no means the sole and
final authority on the matter, but at least I am not seeing anything in
your suggested approach that hasn't already been considered.  You're
welcome to review the prior discussion and try to convince us that we've
made the wrong choice, but I wouldn't expect that to be a quick and easy
process.


On Thu, Nov 5, 2015 at 4:44 PM, Ted Yu  wrote:

> See previous discussion:
> http://search-hadoop.com/m/q3RTtPnPnzwOhBr
>
> FYI
>
> On Thu, Nov 5, 2015 at 4:30 PM, Stephen Boesch  wrote:
>
>> Yes. The current dev/change-scala-version.sh mutates (/pollutes) the
>> build environment by updating the pom.xml in each of the subprojects. If
>> you were able to come up with a structure that avoids that approach it
>> would be an improvement.
>>
>> 2015-11-05 15:38 GMT-08:00 Jakob Odersky :
>>
>>> Hi everyone,
>>> in the process of learning Spark, I wanted to get an overview of the
>>> interaction between all of its sub-projects. I therefore decided to have a
>>> look at the build setup and its dependency management.
>>> Since I am alot more comfortable using sbt than maven, I decided to try
>>> to port the maven configuration to sbt (with the help of automated tools).
>>> This led me to a couple of observations and questions on the build
>>> system design:
>>>
>>> First, currently, there are two build systems, maven and sbt. Is there a
>>> preferred tool (or future direction to one)?
>>>
>>> Second, the sbt build also uses maven "profiles" requiring the use of
>>> specific commandline parameters when starting sbt. Furthermore, since it
>>> relies on maven poms, dependencies to the scala binary version (_2.xx) are
>>> hardcoded and require running an external script when switching versions.
>>> Sbt could leverage built-in constructs to support cross-compilation and
>>> emulate profiles with configurations and new build targets. This would
>>> remove external state from the build (in that no extra steps need to be
>>> performed in a particular order to generate artifacts for a new
>>> configuration) and therefore improve stability and build reproducibility
>>> (maybe even build performance). I was wondering if implementing such
>>> functionality for the sbt build would be welcome?
>>>
>>> thanks,
>>> --Jakob
>>>
>>
>>
>


Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
cannot access S3, unfortunately.

On Thu, Nov 5, 2015 at 3:53 PM Christian  wrote:

> I created the cluster with the following:
>
> --hadoop-major-version=2
> --spark-version=1.4.1
>
> from: spark-1.5.1-bin-hadoop1
>
> Are you saying there might be different behavior if I download
> spark-1.5.1-hadoop-2.6 and create my cluster?
>
> On Thu, Nov 5, 2015 at 1:28 PM, Christian  wrote:
>
>> Spark 1.5.1-hadoop1
>>
>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> > I am using both 1.4.1 and 1.5.1.
>>>
>>> That's the Spark version. I'm wondering what version of Hadoop your
>>> Spark is built against.
>>>
>>> For example, when you download Spark
>>>  you have to select from a
>>> number of packages (under "Choose a package type"), and each is built
>>> against a different version of Hadoop. When Spark is built against Hadoop
>>> 2.6+, from my understanding, you need to install additional libraries
>>>  to access S3. When
>>> Spark is built against Hadoop 2.4 or earlier, you don't need to do this.
>>>
>>> I'm confirming that this is what is happening in your case.
>>>
>>> Nick
>>>
>>> On Thu, Nov 5, 2015 at 12:17 PM Christian  wrote:
>>>
 I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of
 the new feature for instance-profile which greatly helps with this as well.
 Without the instance-profile, we got it working by copying a
 .aws/credentials file up to each node. We could easily automate that
 through the templates.

 I don't need any additional libraries. We just need to change the
 core-site.xml

 -Christian

 On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Thanks for sharing this, Christian.
>
> What build of Spark are you using? If I understand correctly, if you
> are using Spark built against Hadoop 2.6+ then additional configs alone
> won't help because additional libraries also need to be installed
> .
>
> Nick
>
> On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:
>
>> We ended up reading and writing to S3 a ton in our Spark jobs.
>> For this to work, we ended up having to add s3a, and s3 key/secret
>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>
>> I thought maybe I'd share what we did and it might be worth adding
>> these to the spark conf for out of the box functionality with S3.
>>
>> We created:
>>
>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>
>> We changed the contents form the original, adding in the following:
>>
>>   
>> fs.file.impl
>> org.apache.hadoop.fs.LocalFileSystem
>>   
>>
>>   
>> fs.hdfs.impl
>> org.apache.hadoop.hdfs.DistributedFileSystem
>>   
>>
>>   
>> fs.s3.impl
>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>   
>>
>>   
>> fs.s3.awsAccessKeyId
>> {{aws_access_key_id}}
>>   
>>
>>   
>> fs.s3.awsSecretAccessKey
>> {{aws_secret_access_key}}
>>   
>>
>>   
>> fs.s3n.awsAccessKeyId
>> {{aws_access_key_id}}
>>   
>>
>>   
>> fs.s3n.awsSecretAccessKey
>> {{aws_secret_access_key}}
>>   
>>
>>   
>> fs.s3a.awsAccessKeyId
>> {{aws_access_key_id}}
>>   
>>
>>   
>> fs.s3a.awsSecretAccessKey
>> {{aws_secret_access_key}}
>>   
>>
>> This change makes spark on ec2 work out of the box for us. It took us
>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
>> version 2.
>>
>> Best Regards,
>> Christian
>>
>

>>
>


Re: State of the Build

2015-11-05 Thread Patrick Wendell
Hey Jakob,

The builds in Spark are largely maintained by me, Sean, and Michael
Armbrust (for SBT). For historical reasons, Spark supports both a Maven and
SBT build. Maven is the build of reference for packaging Spark and is used
by many downstream packagers and to build all Spark releases. SBT is more
often used by developers. Both builds inherit from the same pom files (and
rely on the same profiles) to minimize maintenance complexity of Spark's
very complex dependency graph.

If you are looking to make contributions that help with the build, I am
happy to point you towards some things that are consistent maintenance
headaches. There are two major pain points right now that I'd be thrilled
to see fixes for:

1. SBT relies on a different dependency conflict resolution strategy than
maven - causing all kinds of headaches for us. I have heard that newer
versions of SBT can (maybe?) use Maven as a dependency resolver instead of
Ivy. This would make our life so much better if it were possible, either by
virtue of upgrading SBT or somehow doing this ourselves.

2. We don't have a great way of auditing the net effect of dependency
changes when people make them in the build. I am working on a fairly clunky
patch to do this here:

https://github.com/apache/spark/pull/8531

It could be done much more nicely using SBT, but only provided (1) is
solved.

Doing a major overhaul of the sbt build to decouple it from pom files, I'm
not sure that's the best place to start, given that we need to continue to
support maven - the coupling is intentional. But getting involved in the
build in general would be completely welcome.

- Patrick

On Thu, Nov 5, 2015 at 10:53 PM, Sean Owen  wrote:

> Maven isn't 'legacy', or supported for the benefit of third parties.
> SBT had some behaviors / problems that Maven didn't relative to what
> Spark needs. SBT is a development-time alternative only, and partly
> generated from the Maven build.
>
> On Fri, Nov 6, 2015 at 1:48 AM, Koert Kuipers  wrote:
> > People who do upstream builds of spark (think bigtop and hadoop distros)
> are
> > used to legacy systems like maven, so maven is the default build. I don't
> > think it will change.
> >
> > Any improvements for the sbt build are of course welcome (it is still
> used
> > by many developers), but i would not do anything that increases the
> burden
> > of maintaining two build systems.
> >
> > On Nov 5, 2015 18:38, "Jakob Odersky"  wrote:
> >>
> >> Hi everyone,
> >> in the process of learning Spark, I wanted to get an overview of the
> >> interaction between all of its sub-projects. I therefore decided to
> have a
> >> look at the build setup and its dependency management.
> >> Since I am alot more comfortable using sbt than maven, I decided to try
> to
> >> port the maven configuration to sbt (with the help of automated tools).
> >> This led me to a couple of observations and questions on the build
> system
> >> design:
> >>
> >> First, currently, there are two build systems, maven and sbt. Is there a
> >> preferred tool (or future direction to one)?
> >>
> >> Second, the sbt build also uses maven "profiles" requiring the use of
> >> specific commandline parameters when starting sbt. Furthermore, since it
> >> relies on maven poms, dependencies to the scala binary version (_2.xx)
> are
> >> hardcoded and require running an external script when switching
> versions.
> >> Sbt could leverage built-in constructs to support cross-compilation and
> >> emulate profiles with configurations and new build targets. This would
> >> remove external state from the build (in that no extra steps need to be
> >> performed in a particular order to generate artifacts for a new
> >> configuration) and therefore improve stability and build reproducibility
> >> (maybe even build performance). I was wondering if implementing such
> >> functionality for the sbt build would be welcome?
> >>
> >> thanks,
> >> --Jakob
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>