Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-13 Thread Noman Khan
+1(non-binding) Regards Noman From: Xiao Li Sent: Tuesday, September 12, 2017 2:44:26 AM To: Matei Zaharia; Hyukjin Kwon Cc: spark-dev Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python +1 Xiao On Mon, 11 Sep 2017 at 6:44 PM

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Michael Armbrust
I think the right way to look at this is the batchId is just a proxy for offsets that is agnostic to what type of source you are reading from (or how many sources their are). We might call into a custom sink with the same batchId more than once, but it will always contain the same data (there is

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Shivaram Venkataraman
Mark, I agree with your point on the risks of using Cloudfront while building Spark. I was only trying to provide background on when we started using Cloudfront. Personally, I don't have enough about context about the test case in question (e.g. Why are we downloading Spark in a test case ?).

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Mark Hamstra
Yeah, but that discussion and use case is a bit different -- providing a different route to download the final released and approved artifacts that were built using only acceptable artifacts and sources vs. building and checking prior to release using something that is not from an Apache mirror.

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Sean Owen
Ah right yeah I know it's an S3 bucket. Thanks for the context. Although I imagine the reasons it was set up no longer apply so much (you can get a direct mirror download link), and so it would probably be possible to retire this, there's also no big rush to. I wasn't clear from the thread whether

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Shivaram Venkataraman
The bucket comes from Cloudfront, a CDN thats part of AWS. There was a bunch of discussion about this back in 2013 https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b1b2de536dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E Shivaram On Wed, Sep 13, 2017 at 9:30 AM, Sean

What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Sean Owen
Not a big deal, but Mark noticed that this test now downloads Spark artifacts from the same 'direct download' link available on the downloads page: https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L53

New to dev community | Contribution to Mlib

2017-09-13 Thread Venali Sonone
Hello, I am new to dev community of Spark and also open source in general but have used Spark extensively. I want to create a complete part on anomaly detection in spark Mlib, For the same I want to know if someone could guide me so i can start the development and contribute to Spark Mlib. Sorry

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Dmitry Naumenko
Thanks, I see. However, I guess reading from checkpoint directory might be less efficient comparing just preserving offsets in Dataset. I have one more question about operation idempotence (hope it help others to have a clear picture). If I read offsets on re-start from RDBMS and manually