Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Yuanjian Li
Hi Joe I think you met this issue: https://issues.apache.org/jira/browse/SPARK-27340 You can check the description in Jira and PR. We also met this in our production env and fixed by the providing PR. The PR is still in review. cc Langchang Zhu(zhuliangch...@baidu.com), who's the author for the

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Russell Spitzer
Actually i just checked the release, they only changed the pyspark part. So the download on the website will still be 2.12 so you'll need to build the scala 2.11 version of Spark if you want to use the connector. Or Submit a PR for scala 2.12 support On Mon, May 6, 2019 at 9:21 PM Russell Spitzer

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Russell Spitzer
Spark 2.4.2 was incorrectly released with the default package binaries set to Scala 2.12 instead of scala 2.11.12 which was supposed to be the case. See the 2.4.3 vote

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Richard Xin
Thanks for the reply. Unfortunately this is the highest version available for Cassandra connector.  One thing I don’t quite understand is that it worked perfectly under Spark 2.4.0. I thought support for Scala 2.11 only became deprecated starting spark 2.4.1, will be removed after spark 3.0

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Russell Spitzer
Scala version mismatched Spark is shown at 2.12, the connector only has a 2.11 release On Mon, May 6, 2019, 7:59 PM Richard Xin wrote: > > org.apache.spark > spark-core_2.12 > 2.4.0 > compile > > > org.apache.spark > spark-sql_2.12 > 2.4.0 > > >

spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Richard Xin
org.apache.spark spark-core_2.12 2.4.0 compile org.apache.spark spark-sql_2.12 2.4.0 com.datastax.spark spark-cassandra-connector_2.11 2.4.1 I run spark-submit I got following exceptions on Spark 2.4.2, it works fine when running  spark-submit under 

Re: Performance Decrease in spark

2019-05-06 Thread Gourav Sengupta
Hi, can you please share you code? Regards, Gourav On Mon, May 6, 2019 at 8:28 AM yuvraj singh <19yuvrajsing...@gmail.com> wrote: > Hi all , > > We moved from spark 2.1.2 to 2.3.3 and all our mysql query went very slow . > > please help me on this . > > Thanks > Yubraj Singh > > > [image:

Re: Dynamic metric names

2019-05-06 Thread Sergey Zhemzhitsky
Hi Saisai, Thanks a lot for the link! This is exactly what I need. Just curious, why this PR has not been merged, as it seems to implement rather natural requirement. There are a number or use cases which can benefit from this feature, e.g. - collecting business metrics based on the data's

Image Grep

2019-05-06 Thread swastik mittal
My spark driver program reads multiple images from hdfs and searches for a particular image using image name. If it finds the image, It converts the received byte array of the image back to its original form. But the image I get on conversion is showing corrupted image. I am using ImageSchema to

Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Joe Ammann
On 5/6/19 6:23 PM, Pat Ferrel wrote: > Streams have no end until watermarked or closed. Joins need bounded datasets, > et voila. Something tells me you should consider the streaming nature of your > data and whether your joins need to use increments/snippets of infinite > streams or to re-join

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hi Andrew, Do not misrepresent my statements. I mentioned it depends on the used case, I NEVER (note the word "never") mentioned that Pandas UDF is ALWAYS (note the word "always") slow. Regards, Gourav Sengupta On Mon, May 6, 2019 at 6:00 PM Andrew Melo wrote: > Hi, > > On Mon, May 6, 2019 at

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hence, what I mentioned initially does sound correct ? On Mon, May 6, 2019 at 5:43 PM Andrew Melo wrote: > Hi, > > On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy > wrote: > > > > Thanks Gourav. > > > > Incidentally, since the regular UDF is row-wise, we could optimize that > a bit by taking

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta wrote: > > Hence, what I mentioned initially does sound correct ? I don't agree at all - we've had a significant boost from moving to regular UDFs to pandas UDFs. YMMV, of course. > > On Mon, May 6, 2019 at 5:43 PM Andrew Melo wrote: >> >>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy wrote: > > Thanks Gourav. > > Incidentally, since the regular UDF is row-wise, we could optimize that a bit > by taking the convert() closure and simply making that the UDF. > > Since there's that MGRS object that we have to create too, we

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
Thanks Gourav. Incidentally, since the regular UDF is row-wise, we could optimize that a bit by taking the convert() closure and simply making that the UDF. Since there's that MGRS object that we have to create too, we could probably optimize it further by applying the UDF via rdd.mapPartitions,

Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Pat Ferrel
Streams have no end until watermarked or closed. Joins need bounded datasets, et voila. Something tells me you should consider the streaming nature of your data and whether your joins need to use increments/snippets of infinite streams or to re-join the entire contents of the streams accumulated

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
The proof is in the pudding :) On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta wrote: > Hi Patrick, > > super duper, thanks a ton for sharing the code. Can you please confirm > that this runs faster than the regular UDF's? > > Interestingly I am also running same transformations using another

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hi Patrick, super duper, thanks a ton for sharing the code. Can you please confirm that this runs faster than the regular UDF's? Interestingly I am also running same transformations using another geo spatial library in Python, where I am passing two fields and getting back an array. Regards,

Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Joe Ammann
Hi all I'm pretty new to Spark and implementing my first non-trivial structured streaming job with outer joins. My environment is a Hortonworks HDP 3.1 cluster with Spark 2.3.2, working with Python. I understood that I need to provide watermarks and join conditions for left outer joins to

Re: Dynamic metric names

2019-05-06 Thread Saisai Shao
I remembered there was a PR about doing similar thing ( https://github.com/apache/spark/pull/18406). From my understanding, this seems like a quite specific requirement, it may requires code change to support your needs. Thanks Saisai Sergey Zhemzhitsky 于2019年5月4日周六 下午4:44写道: > Hello Spark

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
Human time is considerably more expensive than computer time, so in that regard, yes :) This took me one minute to write and ran fast enough for my needs. If you're willing to provide a comparable scala implementation I'd be happy to compare them. @F.pandas_udf(T.StringType(),

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
And you found the PANDAS UDF more performant ? Can you share your code and prove it? On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy wrote: > I disagree that it's hype. Perhaps not 1:1 with pure scala > performance-wise, but for python-based data scientists or others with a lot > of python

Re: Deep Learning with Spark, what is your experience?

2019-05-06 Thread Gourav Sengupta
the main concern is around the model and its accuracy, and then fitting all those CI/CD hype around it. On Sun, May 5, 2019 at 10:37 PM Riccardo Ferrari wrote: > Thanks everyone, I really appreciate your contributions here. > > @Jason, thanks for the references I'll take a look. Quickly

Performance Decrease in spark

2019-05-06 Thread yuvraj singh
Hi all , We moved from spark 2.1.2 to 2.3.3 and all our mysql query went very slow . please help me on this . Thanks Yubraj Singh [image: Mailtrack] Sender notified by Mailtrack