Re: Do we anything for Deep Learning in Spark?

2017-07-05 Thread hosur narahari
Hi Roope, Does this mmlspark project uses GPGPU for processing and just CPU cores since DL models are computationally very intensive. Best Regards, Hari On 6 Jul 2017 9:33 a.m., "Gaurav1809" wrote: > Thanks Roope for the inputs. > > On Wed, Jul 5, 2017 at 11:41 PM,

RE: SparkSession via HS2 - Error: Yarn application has already ended

2017-07-05 Thread Sudha KS
While testing like this, it does not read hive-site.xml, spark-env.sh of the cluster (had to pass in SparkSession.builder().config()). Is there a way to make it read spark config present in the cluster? From: Sudha KS Sent: Wednesday, July 5, 2017 6:45 PM To: user@spark.apache.org Subject: RE:

Re: custom column types for JDBC datasource writer

2017-07-05 Thread Georg Heiler
Great, thanks! But for the current release is there any possibility to be able to catch the exception and handle it i.e. not have spark only log it to the console? Takeshi Yamamuro schrieb am Do., 6. Juli 2017 um 06:44 Uhr: > -dev +user > > You can in master and see >

Re: custom column types for JDBC datasource writer

2017-07-05 Thread Takeshi Yamamuro
-dev +user You can in master and see https://github.com/apache/spark/commit/c7911807050227fcd13161ce090330d9d8daa533 . This option will be available in the next release. // maropu On Thu, Jul 6, 2017 at 1:25 PM, Georg Heiler wrote: > Hi, > is it possible to somehow

Re: Do we anything for Deep Learning in Spark?

2017-07-05 Thread Gaurav1809
Thanks Roope for the inputs. On Wed, Jul 5, 2017 at 11:41 PM, Roope [via Apache Spark User List] < ml+s1001560n2882...@n3.nabble.com> wrote: > Microsoft Machine Learning Library for Apache Spark lets you run CNTK deep > learning models on Spark. > > https://github.com/Azure/mmlspark > > The

UDAFs for sketching Dataset columns with T-Digests

2017-07-05 Thread Erik Erlandson
After my talk on T-Digests in Spark at Spark Summit East, there were some requests for a UDAF-based interface for working with Datasets. I'm pleased to announce that I released a library for doing T-Digest sketching with UDAFs: https://github.com/isarn/isarn-sketches-spark This initial release

Exception: JDK-8154035 using Whole text files api

2017-07-05 Thread Reth RM
Hi, Using sc.wholeTextFiles to read warc file (example file here ). Spark reporting an error with stack trace pasted here : https://pastebin.com/qfmM2eKk Looks like its same as bug reported here:

Re: PySpark working with Generators

2017-07-05 Thread Saatvik Shah
Hi Jörn, I apologize for such a late response. Yes, the data volume is very high(won't fit on 1 machine's memory) and I am getting a significant benefit when reading the files in a distributed manner. Since the data volume is high, converting it to an alternative format would be a worst case

Re: Spark | Window Function |

2017-07-05 Thread Radhwane Chebaane
Hi Julien, Although this is a strange bug in Spark, it's rare to need more than Integer max value size for a window. Nevertheless, most of the window functions can be expressed with self-joins. Hence, your problem may be solved with this example: If input data as follow:

Re: Load multiple CSV from different paths

2017-07-05 Thread Didac Gil
Thanks man! That was the key. source = […].toSeq sources: _* Learnt something more with Scala. > On 5 Jul 2017, at 16:29, Radhwane Chebaane wrote: > > Hi, > > Referring to spark 2.x documentation, in org.apache.spark.sql.DataFrameReader > you have this function:

Collecting matrix's entries raises an error only when run inside a test

2017-07-05 Thread Simone Robutti
Hello, I have this problem and Google is not helping. Instead, it looks like an unreported bug and there are no hints to possible workarounds. the error is the following: Traceback (most recent call last): File "/home/simone/motionlogic/trip-labeler/test/trip_labeler_test/model_test.py", line

Re: Load multiple CSV from different paths

2017-07-05 Thread Radhwane Chebaane
Hi, Referring to spark 2.x documentation, in org.apache.spark.sql.DataFrameReader you have this function: def csv(paths: String*): DataFrame So you

Load multiple CSV from different paths

2017-07-05 Thread Didac Gil
Hi, Do you know any simple way to load multiple csv files (same schema) that are in different paths? Wildcards are not a solution, as I want to load specific csv files from different folders. I came across a solution

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-05 Thread Vadim Semenov
Are you sure that you use S3A? Because EMR says that they do not support S3A https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/ > Amazon EMR does not currently support use of the Apache Hadoop S3A file system. I think that the HEAD requests come from the

RE: SparkSession via HS2 - Error: Yarn application has already ended

2017-07-05 Thread Sudha KS
For now, passing the config in SparkSession: SparkSession spark = SparkSession .builder() .enableHiveSupport() .master("yarn-client") .appName("SampleSparkUDTF_yarnV1")

Re: Spark querying parquet data partitioned in S3

2017-07-05 Thread Steve Loughran
> On 29 Jun 2017, at 17:44, fran wrote: > > We have got data stored in S3 partitioned by several columns. Let's say > following this hierarchy: > s3://bucket/data/column1=X/column2=Y/parquet-files > > We run a Spark job in a EMR cluster (1 master,3 slaves) and

Re: SparkSession via HS2 - Error -spark.yarn.jars not read

2017-07-05 Thread Sandeep Nemuri
STS will refer spark-thrift-sparkconf.conf, Can you check if the spark.yarn.jars exists in this file? On Wed, Jul 5, 2017 at 2:01 PM, Sudha KS wrote: > The property “spark.yarn.jars” available via /usr/hdp/current/spark2- > client/conf/spark-default.conf > > > >

Reading csv.gz files

2017-07-05 Thread Sea aj
I need to import a set of files with csv.gz extension into Spark. each file contains a table of data. I was wondering if anyone knows how to read it? Sent with Mailtrack

Spark | Window Function |

2017-07-05 Thread Julien CHAMP
Hi there ! Let me explain my problem to see if you have a good solution to help me :) Let's imagine that I have all my data in a DB or a file, that I load in a dataframe DF with the following columns : *id | timestamp(ms) | value* A | 100 | 100 A | 110 | 50 B | 100 | 100 B |

RE: SparkSession via HS2 - Error -spark.yarn.jars not read

2017-07-05 Thread Sudha KS
The property "spark.yarn.jars" available via /usr/hdp/current/spark2-client/conf/spark-default.conf spark.yarn.jars hdfs://ambari03.fuzzyl.com:8020/hdp/apps/2.6.1.0-129/spark2 Is there any other way to set/read/pass this property "spark.yarn.jars" ? From: Sudha KS

SparkSession via HS2 - Error -spark.yarn.jars not read

2017-07-05 Thread Sudha KS
Why does "spark.yarn.jars" property not read, in this HDP 2.6 , Spark2.1.1 cluster: 0: jdbc:hive2://localhost:1/db> set spark.yarn.jars; +--+--+ | set

Re: Kafka 0.10 with PySpark

2017-07-05 Thread Saisai Shao
Please see the reason in this thread ( https://github.com/apache/spark/pull/14340). It would better to use structured streaming instead. So I would like to -1 this patch. I think it's been a mistake to support > dstream in Python -- yes it satisfies a checkbox and Spark could claim > there's