Re: Mathematical functions in spark sql

2015-01-26 Thread Alexey Romanchuk
I have tried "select ceil(2/3)", but got "key not found: floor"

On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu  wrote:

> Have you tried floor() or ceil() functions ?
>
> According to http://spark.apache.org/sql/, Spark SQL is compatible with
> Hive SQL.
>
> Cheers
>
> On Mon, Jan 26, 2015 at 8:29 PM, 1esha  wrote:
>
>> Hello everyone!
>>
>> I try execute "select 2/3" and I get "0.". Is there any
>> way
>> to cast double to int or something similar?
>>
>> Also it will be cool to get list of functions supported by spark sql.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-02 Thread Alexey Romanchuk
Any ideas? Anyone got the same error?

On Mon, Dec 1, 2014 at 2:37 PM, Alexey Romanchuk  wrote:

> Hello spark users!
>
> I found lots of strange messages in driver log. Here it is:
>
> 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
> ERROR
> akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter]
> - AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] <-
> [akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address:
> akka.tcp://sparkExecutor@data1.hadoop:17372] [
> akka.remote.ShutDownAssociation: Shut down address:
> akka.tcp://sparkExecutor@data1.hadoop:17372
> Caused by: akka.remote.transport.Transport$InvalidAssociationException:
> The remote system terminated the association because it is shutting down.
> ]
>
> I got this message for every worker twice. First - for driverPropsFetcher
> and next for sparkExecutor. Looks like spark shutdown remote akka system
> incorrectly or there is some race condition in this process and driver sent
> some data to worker, but worker's actor system already in shutdown state.
>
> Except for this message everything works fine. But this is ERROR level
> message and I found it in my "ERROR only" log.
>
> Do you have any idea is it configuration issue, bug in spark or akka or
> something else?
>
> Thanks!
>
>


akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-01 Thread Alexey Romanchuk
Hello spark users!

I found lots of strange messages in driver log. Here it is:

2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
ERROR
akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter]
- AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] <-
[akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address:
akka.tcp://sparkExecutor@data1.hadoop:17372] [
akka.remote.ShutDownAssociation: Shut down address:
akka.tcp://sparkExecutor@data1.hadoop:17372
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The
remote system terminated the association because it is shutting down.
]

I got this message for every worker twice. First - for driverPropsFetcher
and next for sparkExecutor. Looks like spark shutdown remote akka system
incorrectly or there is some race condition in this process and driver sent
some data to worker, but worker's actor system already in shutdown state.

Except for this message everything works fine. But this is ERROR level
message and I found it in my "ERROR only" log.

Do you have any idea is it configuration issue, bug in spark or akka or
something else?

Thanks!


Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hey Sean and spark users!

Thanks for reply. I try -Xcomp right now and start time was about few
minutes (as expected), but I got first query slow as before:
Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 30 columns in 12897 ms:
121.64837 rec/ms, 3649.451 cell/ms

and next

Oct 10, 2014 3:05:03 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1757 ms:
892.94196 rec/ms, 892.94196 cell/ms

I have no idea about caching or other stuff because CPU load is 100% on
worker and jstack show that worker is reading from parquet file.

Any ideas?

Thanks!

On Fri, Oct 10, 2014 at 2:55 PM, Sean Owen  wrote:

> You could try setting "-Xcomp" for executors to force JIT compilation
> upfront. I don't know if it's a good idea overall but might show
> whether the upfront compilation really helps. I doubt it.
>
> However is this almost surely due to caching somewhere, in Spark SQL
> or HDFS? I really doubt hotspot makes a difference compared to these
> much larger factors.
>
> On Fri, Oct 10, 2014 at 8:49 AM, Alexey Romanchuk
>  wrote:
> > Hello spark users and developers!
> >
> > I am using hdfs + spark sql + hive schema + parquet as storage format. I
> > have lot of parquet files - one files fits one hdfs block for one day.
> The
> > strange thing is very slow first query for spark sql.
> >
> > To reproduce situation I use only one core and I have 97sec for first
> time
> > and only 13sec for all next queries. Sure I query for different data,
> but it
> > has same structure and size. The situation can be reproduced after
> restart
> > thrift server.
> >
> > Here it information about parquet files reading from worker node:
> >
> > Slow one:
> > Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
> > Assembled and processed 1560251 records from 30 columns in 11686 ms:
> > 133.51454 rec/ms, 4005.4363 cell/ms
> >
> > Fast one:
> > Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
> > Assembled and processed 1568899 records from 1 columns in 1373 ms:
> 1142.6796
> > rec/ms, 1142.6796 cell/ms
> >
> > As you can see second reading is 10x times faster then first. Most of the
> > query time spent to work with parquet file.
> >
> > This problem is really annoying, because most of my spark task contains
> just
> > 1 sql query and data processing and to speedup my jobs I put special
> warmup
> > query in from of any job.
> >
> > My assumption is that it is hotspot optimizations that used due first
> > reading. Do you have any idea how to confirm/solve this performance
> problem?
> >
> > Thanks for advice!
> >
> > p.s. I have billion hotspot optimization showed with
> -XX:+PrintCompilation
> > but can not figure out what are important and what are not.
>


Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hello spark users and developers!

I am using hdfs + spark sql + hive schema + parquet as storage format. I
have lot of parquet files - one files fits one hdfs block for one day. The
strange thing is very slow first query for spark sql.

To reproduce situation I use only one core and I have 97sec for first time
and only 13sec for all next queries. Sure I query for different data, but
it has same structure and size. The situation can be reproduced after
restart thrift server.

Here it information about parquet files reading from worker node:

Slow one:
Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1560251 records from 30 columns in 11686 ms:
133.51454 rec/ms, 4005.4363 cell/ms

Fast one:
Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1373 ms:
1142.6796 rec/ms, 1142.6796 cell/ms

As you can see second reading is 10x times faster then first. Most of the
query time spent to work with parquet file.

This problem is really annoying, because most of my spark task contains
just 1 sql query and data processing and to speedup my jobs I put special
warmup query in from of any job.

My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance problem?

Thanks for advice!

p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation
but can not figure out what are important and what are not.


Re: Log hdfs blocks sending

2014-09-26 Thread Alexey Romanchuk
Hello Andrew!

Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?

I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log -
https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
log - https://gist.github.com/13h3r/6e5053cf0dbe33f2

Do you have any idea where to look at?

Thanks!

On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash  wrote:

> Hi Alexey,
>
> You should see in the logs a locality measure like NODE_LOCAL,
> PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
> on them and you're reading out of HDFS, then you should be seeing almost
> all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
> uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
> think the data is local and does remote reads which really kills
> performance.
>
> Hope that helps!
> Andrew
>
> On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
> alexey.romanc...@gmail.com> wrote:
>
>> Hello again spark users and developers!
>>
>> I have standalone spark cluster (1.1.0) and spark sql running on it. My
>> cluster consists of 4 datanodes and replication factor of files is 3.
>>
>> I use thrift server to access spark sql and have 1 table with 30+
>> partitions. When I run query on whole table (something simple like select
>> count(*) from t) spark produces a lot of network activity filling all
>> available 1gb link. Looks like spark sent data by network instead of local
>> reading.
>>
>> Is it any way to log which blocks were accessed locally and which are not?
>>
>> Thanks!
>>
>
>


Log hdfs blocks sending

2014-09-25 Thread Alexey Romanchuk
Hello again spark users and developers!

I have standalone spark cluster (1.1.0) and spark sql running on it. My
cluster consists of 4 datanodes and replication factor of files is 3.

I use thrift server to access spark sql and have 1 table with 30+
partitions. When I run query on whole table (something simple like select
count(*) from t) spark produces a lot of network activity filling all
available 1gb link. Looks like spark sent data by network instead of local
reading.

Is it any way to log which blocks were accessed locally and which are not?

Thanks!