RE: PySpark working with Generators

2017-06-29 Thread Mahesh Sawaiker
Wouldn’t this work if you load the files in hdfs and let the partitions be equal to the amount of parallelism you want? From: Saatvik Shah [mailto:saatvikshah1...@gmail.com] Sent: Friday, June 30, 2017 8:55 AM To: ayan guha Cc: user Subject: Re: PySpark working with Generators Hey Ayan, This

RE: using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed

2017-06-29 Thread Mahesh Sawaiker
You could copy the spark folder to home directory of each user and set a different Spark home for each one..not sure what derby is used for, but you could try using mysql instead(if its for the hive metastore) From: Robert Kudyba [mailto:rkud...@fordham.edu] Sent: Wednesday, June 28, 2017 8:25

Re: PySpark working with Generators

2017-06-29 Thread ayan guha
Hi I understand that now. However, your function foo() should take a string and parse it, rather than trying to read from file. This way, you can separate the file read path and process part. r = sc.wholeTextFile(path) parsed = r.map(lambda x: x[0],foo(x[1])) On Fri, Jun 30, 2017 at 1:25 PM,

Re: PySpark working with Generators

2017-06-29 Thread Saatvik Shah
Hey Ayan, This isnt a typical text file - Its a proprietary data format for which a native Spark reader is not available. Thanks and Regards, Saatvik Shah On Thu, Jun 29, 2017 at 6:48 PM, ayan guha wrote: > If your files are in same location you can use sc.wholeTextFile.

Re: Python Spark for full fledged ETL

2017-06-29 Thread Matt Deaver
While you could do this in Spark it stinks of over-engineering. An ETL tool would be more appropriate, and if budget is an issue you could look at alternatives like Pentaho or Talend. On Thu, Jun 29, 2017 at 8:48 PM, wrote: > Hi, > > One more thing - i am talking about

Re: Python Spark for full fledged ETL

2017-06-29 Thread upkar . kohli
Hi, One more thing - i am talking about spark in cluster mode without hadoop. Regards, Upkar Sent from my iPhone > On 30-Jun-2017, at 07:55, upkar.ko...@gmail.com wrote: > > Hi, > > This is my line of thinking - Spark offers a variety of transformations which > would support most of the use

Re: Python Spark for full fledged ETL

2017-06-29 Thread upkar . kohli
Hi, This is my line of thinking - Spark offers a variety of transformations which would support most of the use cases for replacing an ETL tool such as Informatica. ET part of ETL is perfectly covered. Loading may generally require more functionality though. Spinning up Informatica cluster

Project tungsten phase2 - SIMD and columnar in-memory storage

2017-06-29 Thread Nirav Patel
I read following future optimizations in Tungsten on databricks site. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html "There are also a handful of longer term possibilities for Tungsten. In particular, we plan to investigate compilation to LLVM or

Re: Interesting Stateful Streaming question

2017-06-29 Thread kant kodali
Is mapWithState an answer for this ? https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html On Thu, Jun 29, 2017 at 11:55 AM, kant kodali wrote: > Hi All, > > Here is a problem and I am wondering if Spark Streaming is the

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol
Thanks so much Zhang. This definitely helps. From: Yong Zhang [mailto:java8...@hotmail.com] Sent: Thursday, June 29, 2017 4:59 PM To: Talap, Amol; Judit Planas; user@spark.apache.org Subject: Re: SparkSQL to read XML Blob data to create multiple rows scala>spark.version res6: String = 2.1.1

Re: Spark standalone , client mode. How do I monitor?

2017-06-29 Thread Nirav Patel
you can use ganglia, ambari or nagios to monitor spark workers/masters. Spark executors are resilient. There are may proprietary software companies as well that just do hadoop application monitoring. On Tue, Jun 27, 2017 at 5:03 PM, anna stax wrote: > Hi all, > > I have

Re: Does spark support Apache Arrow

2017-06-29 Thread Nirav Patel
Kwon, Isn't that JIRA is part of integration with Arrow. As far as arrow as in-memory store goes it probably conflicts with spark's own tungsten memory representation, right? Thanks Nir On Thu, May 19, 2016 at 8:03 PM, Hyukjin Kwon wrote: > FYI, there is a JIRA for this,

Re: Apache Arrow + Spark examples?

2017-06-29 Thread Nirav Patel
bump. I have same question at Petr. SPARK-13534 seem to only solve de(serialization) issue involved between rdd and python objects. However, isn't Arrow can be standard for in-memory columnar representation? may be alternative to spark current in-memory store (k-v blocks or tungsten) Thanks Nir

Re: PySpark working with Generators

2017-06-29 Thread ayan guha
If your files are in same location you can use sc.wholeTextFile. If not, sc.textFile accepts a list of filepaths. On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 wrote: > Hi, > > I have this file reading function is called /foo/ which reads contents into > a list of

Re: [Spark ML] LogisticRegressionWithSGD

2017-06-29 Thread Md. Rezaul Karim
+1 On Jun 29, 2017 10:46 PM, "Kevin Quinn" wrote: > Hello, > > I'd like to build a system that leverages semi-online updates and I wanted > to use stochastic gradient descent. However, after looking at the > documentation it looks like that method is deprecated. Is there

[Spark ML] LogisticRegressionWithSGD

2017-06-29 Thread Kevin Quinn
Hello, I'd like to build a system that leverages semi-online updates and I wanted to use stochastic gradient descent. However, after looking at the documentation it looks like that method is deprecated. Is there a reason why it was deprecated? Is there a planned replacement? As far as I know

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Yong Zhang
scala>spark.version res6: String = 2.1.1 scala>val rdd = sc.parallelize(Seq("""Title1.1Description_1.1 Title1.2Description_1.2 Title1.3Description_1.3 """)) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24 scala>import com.databricks.spark.xml.XmlReader

Spark, S3A, and 503 SlowDown / rate limit issues

2017-06-29 Thread Everett Anderson
Hi, We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O from/to S3 from our Spark jobs. We set mapreduce.fileoutputcommitter.algorithm.version=2 and are using encrypted S3 buckets. This has been working fine for us, but perhaps as we've been running more jobs in parallel,

PySpark working with Generators

2017-06-29 Thread saatvikshah1994
Hi, I have this file reading function is called /foo/ which reads contents into a list of lists or into a generator of list of lists representing the same file. When reading as a complete chunk(1 record array) I do something like: rdd = file_paths_rdd.map(lambda x:

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Gourav Sengupta
Hi Sotiris, can you upload any sample data and then I will write the code and send it across to you? Do you have any particular used case for Influx DB? Regards, Gourav Sengupta On Thu, Jun 29, 2017 at 5:42 PM, Sotiris Beis wrote: > Hi Gourav, > > Do you have any

Interesting Stateful Streaming question

2017-06-29 Thread kant kodali
Hi All, Here is a problem and I am wondering if Spark Streaming is the right tool for this ? I have stream of messages m1, m2, m3and each of those messages can be in state s1, s2, s3,sn (you can imagine the number of states are about 100) and I want to compute some metrics that visit all

Re: The stability of Spark Stream Kafka 010

2017-06-29 Thread Cody Koeninger
Given the emphasis on structured streaming, I don't personally expect a lot more work being put into DStreams-based projects, outside of bugfixes. Stable designation is kind of arbitrary at that point. That 010 version wasn't developed until spark 2.0 timeframe, but you can always try

Re: IDE for python

2017-06-29 Thread Gourav Sengupta
Hi Wan, Out out tremendous curiosity why did you switch, instead of extending? Scala will provide you fantastic advantages with its functional programming approach and very less translation loss to SPARK internals. Infact I just learned Scala to get a better understanding and appreciate SPARK.

Spark querying parquet data partitioned in S3

2017-06-29 Thread fran
We have got data stored in S3 partitioned by several columns. Let's say following this hierarchy: s3://bucket/data/column1=X/column2=Y/parquet-files We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the following: A) - When we declare the initial dataframe to be the whole

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Sotiris Beis
Hi Gourav, Do you have any suggestions of how the use of dataframes will solve my problem? Cheers, Sotiris On Thu, 29 Jun 2017 at 17:37 Gourav Sengupta wrote: > Hi, > > I still do not understand why people do not use data frames. > > It makes you smile, take a sip

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Gourav Sengupta
Hi, I still do not understand why people do not use data frames. It makes you smile, take a sip of fine coffee, and feel good about life and its all courtesy@SPARK. :) Regards, Gourav Sengupta On Thu, Jun 29, 2017 at 12:18 PM, Ryan wrote: > I think it creates a new

spark.pyspark.python is ignored?

2017-06-29 Thread Jason White
According to the documentation, `spark.pyspark.python` configures which python executable is run on the workers. It seems to be ignored in my simple test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock. The only customization at this point is my Hadoop configuration directory.

Re: Python Spark for full fledged ETL

2017-06-29 Thread Gourav Sengupta
SPARK + JDBC. But Why? Regards, Gourav Sengupta On Thu, Jun 29, 2017 at 3:44 PM, upkar_kohli wrote: > Hi, > > Has anyone tried mixing Spark with some of the other python jdbc/odbc > packages to create an end to end ETL framework. Framwork would enable > making update,

The stability of Spark Stream Kafka 010

2017-06-29 Thread Martin Peng
Hi, We planned to upgrade our Spark Kafka library to 0.10 from 0.81 to simplify our infrastructure code logic. Does anybody know when will the 010 version become stable from experimental? May I use this 010 version together with Spark 1.5.1?

Spark on yarn logging

2017-06-29 Thread John Vines
I followed the instructions for configuring a custom logger per https://spark.apache.org/docs/2.0.2/running-on-yarn.html (because we have long running spark jobs, sometimes occasionally get stuck and without a rolling file appender will fill up disk). This seems to work well for us, but it breaks

Python Spark for full fledged ETL

2017-06-29 Thread upkar_kohli
Hi, Has anyone tried mixing Spark with some of the other python jdbc/odbc packages to create an end to end ETL framework. Framwork would enable making update, delete and other DML operations along with Stored proc / function calls across variety of databases. Any setup that would be easy to use.

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
Hi Ayan, I'm afraid I can't help you here. I'm a beginner in Spark with 0 experience in Scala (always used PySpark). In my case, I had a list of numbers, and the explode function was exactly what I was looking for. Sorry about that, hope you can solve it. Best, Judit On 29/06/17 15:38,

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol
Thanks Judit, Ayan Judit, You almost got it. The explode might help here. But when I tried I see load() doesn’t like to read from xmlcomment column on oracle_data. scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data") 17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT *

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Ryan
I think it creates a new connection on each worker, whenever the Processor references Resource, it got initialized. There's no need for the driver connect to the db in this case. On Thu, Jun 29, 2017 at 5:52 PM, salvador wrote: > Hi all, > > I am writing a spark job from

Understanding how spark share db connections created on driver

2017-06-29 Thread salvador
Hi all, I am writing a spark job from which at some point I want to send some metrics to InfluxDB. Here is some sample code of how I am doing it at the moment. I have a Resources object class which contains all the details for the db connection: object Resources { def forceInit: () => Unit = ()

Re: What's the simplest way to Read Avro records from Kafka to Spark DataSet/DataFrame?

2017-06-29 Thread kant kodali
Forgot to mention I am getting a stream of Avro records and I want to do Structured streaming on these Avro records but first I wan to be able to parse them and put them in a DataSet or something like that. On Thu, Jun 29, 2017 at 12:56 AM, kant kodali wrote: > Hi All, > >

What's the simplest way to Read Avro records from Kafka to Spark DataSet/DataFrame?

2017-06-29 Thread kant kodali
Hi All, What's the simplest way to Read Avro records from Kafka and put it into Spark DataSet/DataFrame without using Confluent Schema registry or Twitter Bijection API? Thanks!

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode Here you can find a nice example: https://stackoverflow.com/questions/38210507/explode-in-pyspark

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread ayan guha
Hi Not sure if I follow your issue. Can you please post output of books_inexp.show()? On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol wrote: > Hi: > > > > We are trying to parse XML data to get below output from given input > sample. > > Can someone suggest a way to pass

sqlstream for real time analytics

2017-06-29 Thread Mich Talebzadeh
Hi, has anyone had experience of using sqlstream for real time analytics the whole blaze package by any chance? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

sqlstream for real time analytics

2017-06-29 Thread Mich Talebzadeh
Hi, has anyone had experience of using sqlstream for real time analytics the whole blaze package by any chance? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw