Re: (External Email) Re: Simplify Griffin-DSL implementation

2019-01-29 Thread Afshin, Bardia
unsubscribe On 1/29/19, 3:11 PM, "Grant" wrote: We could have a SQL syntax checker using the existing parser logic, Once it detects the SQL expression with the DSL type "griffin-dsl", it could take the following steps 1. attempt to delegate the execution of the rule to "spark-s

Re: (External Email) [GitHub] asfgit merged pull request #20: Replace images/project.jpg

2019-01-21 Thread Afshin, Bardia
Unsubscribe! On 1/21/19, 6:04 AM, "GitBox" wrote: asfgit merged pull request #20: Replace images/project.jpg URL: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgriffin-site%2Fpull%2F20&data=02%7C01%7Cbardia.afshin%40changehealthcare.com%7C6cf420

Re: (External Email) Re: griffin技术交流

2019-01-04 Thread Afshin, Bardia
Unsubscribe me please On 1/3/19, 6:48 PM, "Zhen Li" wrote: Hi Lionel, Excuse me, but I didn’t see the QR code in mail attachment. I am a software engineer in a e-commerce company and our big data platform need a data quality system recently. Many thanks. Zhen > 在 20

Re: (External Email) [GitHub] griffin issue #466: Remove incubator

2018-12-10 Thread Afshin, Bardia
unsubscribe On 12/6/18, 4:22 AM, "guoyuepeng" wrote: Github user guoyuepeng commented on the issue: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgriffin%2Fpull%2F466&data=02%7C01%7Cbardia.afshin%40changehealthcare.com%7C3c8f92693f3b46e3fe4

Spark-submit Py-files with EMR add step?

2018-03-07 Thread Afshin, Bardia
I’m writing this email to reach out to the community to demisty the py-files parameter when working with spark-submit and python projects. Currently I have a project, say Src/ * Main.py * Modules/module1.py When I zip up the src directory and submit it to spark via emr add step , the

UDF issues with spark

2017-12-08 Thread Afshin, Bardia
Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when running the udf function on a recordset count of 10 with a mapping of the same value (arbirtrary for testing purposes). This is on amazon EMR release label 5.6.0 with the following hardware specs m4.4xlarge 32 vCPU, 64 GiB m

spark write hex null string terminates into columns

2017-05-12 Thread Afshin, Bardia
I’m running a process where I load the original data, remove some column and write out the columns remaining into a output file. Spark is putting in hex 00 into some of the columns and this is causing issues when importing into RedShift. What’s the most efficient way to resolve this?

long running jobs with Spark

2017-05-04 Thread Afshin, Bardia
Starting long running jobs with upstarts on linux (spark-submit) is super slow. I can see only a small percentage of the CPU is being utilized and applying nice –n 20 to the process doesn’t seem to do anything. Anyone dealt with long running processes / jobs on Spark and has any best practices t

Re: weird error message

2017-04-26 Thread Afshin, Bardia
Kicking off the process from ~ directory makes the message go away. I guess the metastore_db created is relative to path of where it’s executed. FIX: kick off from ~ directory ./spark-2.1.0-bin-hadoop2.7/bin/pysark From: "Afshin, Bardia" Date: Wednesday, April 26, 2017 at 9:47 AM

Re: weird error message

2017-04-26 Thread Afshin, Bardia
pache.spark.sql.hive.HiveSessionState':" >>> ubuntu@:~/spark-2.1.0-bin-hadoop2.7$ ps aux | grep spark ubuntu 2796 0.0 0.0 10460 932 pts/0S+ 16:44 0:00 grep --color=auto spark From: Jacek Laskowski Date: Wednesday, April 26, 2017 at 12:51 AM To: "Afshin, Bardia

weird error message

2017-04-25 Thread Afshin, Bardia
I’m having issues when I fire up pyspark on a fresh install. When I submit the same process via spark-submit it works. Here’s a dump of the trace: at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Nati

community feedback on RedShift with Spark

2017-04-24 Thread Afshin, Bardia
I wanted to reach out to the community to get a understanding of what everyones experience is in regardst to maximizing performance as in decreasing load time on loading multiple large datasets to RedShift. Two approaches: 1. Spark writes file to S3, RedShift COPY INTO from S3 bucket. 2.

removing columns from file

2017-04-24 Thread Afshin, Bardia
Hi there, I have a process that downloads thousands of files from s3 bucket, removes a set of columns from it, and upload it to s3. S3 is currently not the bottleneck, having a Single Master Node Spark instance is the bottleneck. One approach is to distribute the files on multiple Spark Maste

Re: How to maintain order of key-value in DataFrame same as JSON?

2017-04-24 Thread Afshin, Bardia
Is there a API available to do this via SparkSession? Sent from my iPhone On Apr 24, 2017, at 6:20 AM, Devender Yadav mailto:devender.ya...@impetus.co.in>> wrote: Thanks Hemanth for a quick reply. From: Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> Sent:

question regarding pyspark

2017-04-21 Thread Afshin, Bardia
I’m ingesting a CSV with hundreds of columns and the original CSV file it’self doesn’t have any header. I do have a separate file that is just the headers, is there a way to tell Spark API this information when loading the CSV file? Or do I have to do some preprocesisng before doing so? Thanks,

Hadoop s3 integration for Spark

2017-04-14 Thread Afshin, Bardia
Hello community. I’m considering consuming s3 objects via Hadoop via s3a protocol. The main purpose of this is to utilize Spark to access s3, and it seems like the only formal protocol / integration for doing so is Hadoop. The process that I am implementing is rather formal and straight forward