/d57daf1f7732a7ac54a91fe112deeda0a254f9ef/python/pyspark/sql/types.py
--
Ruslan Dautkhanov
On Wed, Mar 16, 2016 at 4:44 PM, Reynold Xin <r...@databricks.com> wrote:
> We probably should have the alias. Is this still a problem on master
> branch?
>
> On Wed, Mar 16, 2016 at 9:40 AM, Ruslan D
r: Could not parse datatype: bigint
Looks like pyspark.sql.types doesn't know anything about bigint..
Should it be aliased to LongType in pyspark.sql.types?
Thanks
On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:
> Hello,
>
> Looking at
>
&
ntegerType() for "integer" etc? If it doesn't exist it would be great to
have such a
mapping function.
Thank you.
ps. I have a data frame, and use its dtypes to loop through all columns to
fix a few
columns' data types as a workaround for SPARK-13866.
--
Ruslan Dautkhanov
Spark session dies out after ~40 hours when running against Hadoop Secure
cluster.
spark-submit has --principal and --keytab so kerberos ticket renewal works
fine according to logs.
Some happens with HDFS dfs connection?
These messages come up every 1 second:
See complete stack:
is known and well documented.
--
Ruslan Dautkhanov
Turns to be it is a Spark issue
https://issues.apache.org/jira/browse/SPARK-13478
--
Ruslan Dautkhanov
On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:
> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be
For a Spark job that only does shuffling
(e.g. Spark SQL with joins, group bys, analytical functions, order bys),
but no explicit persistent RDDs nor dataframes (there are no .cache()es in
the code),
what would be the lowest recommended setting
for spark.storage.memoryFraction?
Yep, I tried that. It seems you're right. Got an error that execution
engine has to be set to mr.
hive.execution.engine = mr
I did not keep exact error message/stack. It's probably disabled explicitly.
--
Ruslan Dautkhanov
On Thu, Jan 28, 2016 at 7:03 AM, Todd <bit1...@163.com> wrote:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
There are quite a lot of knobs to tune for Hive on Spark.
Above page recommends following settings:
mapreduce.input.fileinputformat.split.maxsize=75000
> hive.vectorized.execution.enabled=true
>
I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36
Feel free to close it if doesn't belong to Livy project.
I really don't know if this is a Spark or a Livy/Sentry problem.
Any ideas for possible workarounds?
Thank you.
--
Ruslan Dautkhanov
On Mon, Jan 18, 2016
() hence
the error.
So Sentry isn't compatible with Spark in kerberized clusters? Is any
workaround for this problem?
--
Ruslan Dautkhanov
On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux <rom...@cloudera.com> wrote:
> Livy does not support any Kerberos yet
> https://issues.cloudera
d to impersonate to other users.
So very convenient for Spark Notebooks.
Any information to help solve this will be highly appreciated.
--
Ruslan Dautkhanov
Livy build test from master fails with below problem. Can't track it down.
YARN shows Livy Spark yarn application as running.
Although attempt to connect to application master shows connection refused:
HTTP ERROR 500
> Problem accessing /proxy/application_1448640910222_0046/. Reason:
>
Spark
> 1.3.1 does not provide integration with Phoenix for kerberized cluster.
>
> Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured
> cluster or not?
>
> Thanks,
> Akhilesh
>
> On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov <dautkha...@gmail.com&g
-table-in-hive/34059289#34059289
--
Ruslan Dautkhanov
On Mon, Dec 7, 2015 at 11:27 AM, Test One <t...@cksworks.com> wrote:
> I'm using spark-avro with SparkSQL to process and output avro files. My
> data has the following schema:
>
> root
> |-- memberUuid: st
)
kerberos ticket for authentication to pass.
--
Ruslan Dautkhanov
On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:
> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. I am
> trying to run Spark on Hbase using Phoenix, b
An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
AFAIK Spark supports views too.
--
Ruslan Dautkhanov
On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
nyig
java#welcome-to-livy-the-rest-spark-server>
"
Although that post is from April 2015, not sure if it's still accurate.
--
Ruslan Dautkhanov
On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar <deenar.toras...@gmail.com
> wrote:
> Hi
>
> I had the same question. Anyone havi
more even distriubution you could use a hash function
from that not just a remainder.
--
Ruslan Dautkhanov
On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin <mcgloin.patr...@gmail.com>
wrote:
> I will answer my own question, since I figured it out. Here is my answer
> in case any
You could write your own UDF isdate().
--
Ruslan Dautkhanov
On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani <rrav...@gmail.com> wrote:
> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue,
thought it's primary use
is for Hue and similar services which uses impersonation quite heavily in
kerberized cluster.
--
Ruslan Dautkhanov
On Wed, Nov 4, 2015 at 1:40 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0]
> hdfs.KeyP
https://issues.apache.org/jira/browse/SPARK-8992
Should be in 1.6?
--
Ruslan Dautkhanov
On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss <ascot.m...@gmail.com> wrote:
> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
>
/apache/spark/sql/SQLContext.html
and can't find anything relevant.
Thanks!
--
Ruslan Dautkhanov
Thank you Richard and Matthew.
DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned
earlier, we're on CDH 5.4 / Spark 1.3. No options for this version?
Best regards,
Ruslan Dautkhanov
On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas <rhil...@us.ibm.com> wrote:
).
--
Ruslan Dautkhanov
On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:
> Wanted to take something like this
>
> https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
> and create a Hive UDAF to create an aggregate fun
Similar setup for Hue
http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/
Might give you an idea.
--
Ruslan Dautkhanov
On Thu, Sep 17, 2015 at 9:50 AM, mjordan79 <renato.per...@gmail.com> wrote:
> Hello!
> I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI
Wanted to take something like this
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
and create a Hive UDAF to create an aggregate function that returns a data
type guess.
Am I inventing a wheel?
Does Spark have something like this already built-in?
Would be very useful
Thank you Alexander.
Sounds like quite a lot of good and exciting changes slated for Spark's ANN.
Looking forward to it.
--
Ruslan Dautkhanov
On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander <alexander.ula...@hpe.com>
wrote:
> Thank you, Feynman, this is helpful. The paper that
Sathish,
Thanks for pointing to that.
https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm
That must be only part of Oracle's BDA codebase, not open-source Hive,
right?
--
Ruslan Dautkhanov
On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.
s point is more relevant for OLTP-like queries which Spark is probably
not yet good at (e.g. return a few rows quickly/ winthin a few ms).
--
Ruslan Dautkhanov
On Thu, Sep 10, 2015 at 12:07 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> Either that or use the DataFrame API, wh
You can also sqoop oracle data in
$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl
--username MOVIEDEMO --password welcome1 --table ACTIVITY
http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/
--
Ruslan Dautkhanov
On Tue
http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
Implementation seems missing backpropagation?
Was there is a good reason to omit BP?
What are the drawbacks of a pure feedforward-only ANN?
Thanks!
--
Ruslan Dautkhanov
/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
should read B :)
--
Ruslan Dautkhanov
On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang <fli...@databricks.com>
wrote:
> Backprop is used to compute the gradient here
> <https://github.com/apache/sp
Found a dropout commit from avulanov:
https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d
It probably hasn't made its way to MLLib (yet?).
--
Ruslan Dautkhanov
On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang <fli...@databricks.com> wrote:
> Unfortunately
Read response from Cheng Lian <lian.cs@gmail.com> on Aug/27th - it
looks the same problem.
Workarounds
1. write that parquet file in Spark;
2. upgrade to Spark 1.5.
--
Ruslan Dautkhanov
On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov <ale...@gmail.com> wrote:
> No, it was
That parquet table wasn't created in Spark, is it?
There was a recent discussion on this list that complex data types in Spark
prior to 1.5 often incompatible with Hive for example, if I remember
correctly.
On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote:
> I am trying to read
est/topics/sg_hdfs_sentry_sync.html
--
Ruslan Dautkhanov
On Thu, Sep 3, 2015 at 1:46 PM, Daniel Schulz <danielschulz2...@hotmail.com>
wrote:
> Hi Matei,
>
> Thanks for your answer.
>
> My question is regarding simple authenticated Spark-on-YARN only, without
> Ker
https://issues.apache.org/jira/browse/SPARK-7660 ?
--
Ruslan Dautkhanov
On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio tarop...@gmail.com wrote:
Right after upgraded to 1.4.1, we started seeing this exception and yes we
picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6
There is no Spark master in YARN mode. It's standalone mode terminology.
In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
will be restarted
automatically by RM up to yarn.resourcemanager.am.max-retries
times (default is 2).
--
Ruslan Dautkhanov
On Fri, Jul 17, 2015 at 1
for Spark?
--
Ruslan Dautkhanov
On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu h...@linkedin.com.invalid
wrote:
We are in the middle of figuring that out. At the high level, we want to
combine the best parts of existing workflow solutions.
On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone vikramk
.
--
Ruslan Dautkhanov
On Mon, Aug 10, 2015 at 3:53 PM, YaoPau jonrgr...@gmail.com wrote:
I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
iPython Notebook. I'm getting collect() to work just fine, but take()
errors. (I'm having issues with collect() on other datasets
You should also take into account amount of memory that you plan to use.
It's advised not to give too much memory for each executor .. otherwise GC
overhead will go up.
Btw, why prime numbers?
--
Ruslan Dautkhanov
On Wed, Jul 29, 2015 at 3:31 AM, ponkin alexey.pon...@ya.ru wrote:
Hi Rahul
bandwidth-bound, I can see it'll be a few
percent to no improvement.
--
Ruslan Dautkhanov
On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus edel...@gmail.com wrote:
H
2% huh.
-- ttfn
Simon Edelhaus
California 2015
On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra m
or in-memory
columnar storage caching
from traditional RDBMS systems, and may get better and/or more predictable
performance on
BI queries.
--
Ruslan Dautkhanov
On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan renga.kan...@gmail.com
wrote:
All,
I really appreciate anyone's input on this. We are having
Or Spark on HBase )
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
--
Ruslan Dautkhanov
On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. that is, key-value stores
Please consider HBase for this purpose :-)
On Tue, Jul 14, 2015 at 5:55 PM
Should be part of Spark 1.4
https://issues.apache.org/jira/browse/SPARK-1442
I don't see it in the documentation though
https://spark.apache.org/docs/latest/sql-programming-guide.html
--
Ruslan Dautkhanov
On Mon, Jul 6, 2015 at 5:06 AM, gireeshp gireesh.puthum...@augmentiq.in
wrote
the executor receives a SIGTERM (from whom???)
From YARN Resource Manager.
Check if yarn fair scheduler preemption and/or speculative execution are
turned on,
then it's quite possible and not a bug.
--
Ruslan Dautkhanov
On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim jongw...@nyu.edu wrote
You can see what Spark SQL functions are supported in Spark by doing the
following in a notebook:
%sql show functions
https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html
I think Spark SQL support is currently around Hive ~0.11?
--
Ruslan
Hi Akhil,
It's interesting if RDDs are stored internally in a columnar format as well?
Or it is only when an RDD is cached in SQL context, it is converted to
columnar format.
What about data frames?
Thanks!
--
Ruslan Dautkhanov
On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das ak
Scala used to run on .NET
http://www.scala-lang.org/old/node/10299
--
Ruslan Dautkhanov
On Thu, Jul 2, 2015 at 1:26 PM, pedro ski.rodrig...@gmail.com wrote:
You might try using .pipe() and installing your .NET program as a binary
across the cluster (or using addFile). Its not ideal to pipe
://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_resource_pools.html
--
Ruslan Dautkhanov
On Thu, Jul 2, 2015 at 4:20 PM, Alexander Waldin awal...@inflection.com
wrote:
Hi,
I'd like to specify the total sum of cores
You can run
hadoop checknative -a
and see if bzip2 is detected correctly.
--
Ruslan Dautkhanov
On Fri, Jun 26, 2015 at 10:18 AM, Marcelo Vanzin van...@cloudera.com
wrote:
What master are you using? If this is not a local master, you'll need to
set LD_LIBRARY_PATH on the executors also
https://spark.apache.org/docs/latest/streaming-flume-integration.html
Yep, avro sink is the correct one.
--
Ruslan Dautkhanov
On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi!
I want to integrate flume with spark streaming. I want to know which sink
Vadim,
You could edit /etc/fstab, then issue mount -o remount to give more shared
memory online.
Didn't know Spark uses shared memory.
Hope this helps.
On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy vadim.bichuts...@gmail.com
wrote:
Hello Spark Experts,
I've been running a standalone Spark
Got it. Thanks!
--
Ruslan Dautkhanov
On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng men...@gmail.com wrote:
ChiSqSelector calls an RDD of labeled points, where the label is the
target. See
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature
Thank you Xiangrui.
Oracle's attribute importance mining function have a target variable.
Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target.
MLlib's ChiSqSelector does not have a target variable.
--
Ruslan Dautkhanov
in predicting a target.
Best regards,
Ruslan Dautkhanov
?
Best reagrds,
Ruslan Dautkhanov
amount of memory node has.
--
Ruslan Dautkhanov
On Thu, Jun 4, 2015 at 8:59 AM, Chao Chen kandy...@gmail.com wrote:
Hi all,
I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and
Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory.
One is configured as headnode
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf
--
Ruslan Dautkhanov
On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Thank you, Tathagata
*spark.shuffle.safetyFraction)/
spark.executor.cores. Memory fraction and safety fraction default to 0.2
and 0.8 respectively.
I'd test spark.executor.cores with 2,4,8 and 16 and see what makes your job
run faster..
--
Ruslan Dautkhanov
On Wed, May 27, 2015 at 6:46 PM, Mulugeta Mammo mulugeta.abe
application logs.
--
Ruslan Dautkhanov
On Thu, May 21, 2015 at 5:08 AM, Oleg Ruchovets oruchov...@gmail.com
wrote:
Doesn't work for me so far ,
using command but got such output. What should I check to fix the
issue? Any configuration parameters ...
[root@sdo-hdp-bd-master1 ~]# yarn
Oleg,
You can see applicationId in your Spark History Server.
Go to http://historyserver:18088/
Also check
https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application
It should be no different with PySpark.
--
Ruslan Dautkhanov
On Wed, May 20, 2015 at 2:12 PM, Oleg
You could use
yarn logs -applicationId application_1383601692319_0008
--
Ruslan Dautkhanov
On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets oruchov...@gmail.com
wrote:
Hi ,
I am executing PySpark job on yarn ( hortonworks distribution).
Could someone pointing me where is the log
Had the same question on stackoverflow recently
http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark
Lomig Mégard had a detailed answer of how to do this without using LATERAL
VIEW.
On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh ashish23...@gmail.com
wrote:
65 matches
Mail list logo