RE: Jars directory in Spark 2.0

2017-01-31 Thread Sidney Feiner
Is this done on purpose? Because it really makes it hard to deploy applications. Is there a reason they didn't shade the jars they use to begin with? Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp] From: Koert Kuipers

Hive Java UDF running on spark-sql issue

2017-01-31 Thread Alex
Hi , we have Java Hive UDFS which are working perfectly fine in Hive SO for Better performance we are migrating the same To Spark-sql SO these jar files we are giving --jars argument to spark-sql and defining temporary functions to make it to run on spark-sql there is this particular Java UDF

Re: does both below code do the same thing? I had to refactor code to fit in spark-sql

2017-01-31 Thread Alex
Guys! Please Reply On Tue, Jan 31, 2017 at 12:31 PM, Alex wrote: > public Object get(Object name) { > int pos = getPos((String) name); > if (pos < 0) > return null; > String f = "string"; >

Parameterized types and Datasets - Spark 2.1.0

2017-01-31 Thread Don Drake
I have a set of CSV that I need to perform ETL on, with the plan to re-use a lot of code between each file in a parent abstract class. I tried creating the following simple abstract class that will have a parameterized type of a case class that represents the schema being read in. This won't

Re: Converting timezones in Spark

2017-01-31 Thread Don Drake
So, to follow up on this. A few lessons learned, when you print a timestamp, it will only show the date/time in your current timezone, regardless of any conversions you applied to it. The trick is to convert it (cast) to a Long, and then the Java8 java.time.* functions can translate to any

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Just to be clear the pool object creation happens in the driver code, and not in any anonymous function which should be executed in the executor. On Tue, Jan 31, 2017 at 10:21 PM Nipun Arora wrote: > Thanks for the suggestion Ryan, I will convert it to singleton and

Re: JavaRDD text matadata(file name) findings

2017-01-31 Thread Hyukjin Kwon
Hi, Are you maybe possible to switch it to text datasource with input_file_name function? Thanks. On 1 Feb 2017 3:58 a.m., "Manohar753" wrote: Hi All, myspark job is reading data from a folder having different files with same structured data. the red JavaRdd

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
The KafkaProducerPool instance is created in the driver. Right? What's I was saying is when a Spark job runs, it will serialize KafkaProducerPool and create a new instance in the executor side. You can use the singleton pattern to make sure one JVM process has only one KafkaProducerPool instance.

Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-01-31 Thread Koert Kuipers
set is currently not supported. you can use kryo encoder. there is no other work around that i know of. import org.apache.spark.sql.{ Encoder, Encoders } implicit def setEncoder[X]: Encoder[Set[X]] = Encoders.kryo[Set[X]] On Tue, Jan 31, 2017 at 7:33 PM, Jerry Lam wrote:

Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-01-31 Thread Jerry Lam
Hi guys, I got an exception like the following, when I tried to implement a user defined aggregation function. Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Set[(scala.Long, scala.Long)] The Set[(Long, Long)] is a field in the case class which is the

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Koert Kuipers
i thought RDD.checkpoint is async? checkpointData is indeed updated synchronously, but checkpointData.isCheckpointed is false until the actual checkpoint operation has completed. and until the actual checkpoint operation is done any operation will be on the original rdd. for example notice how

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
It's a producer pool, the borrow object takes an existing kafka producer object if it is free, or creates one if all are being used. Shouldn't we re-use kafka producer objects for writing to Kafka. @ryan- can you suggest a good solution for writing a dstream to kafka which can be used in

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
Looks like you create KafkaProducerPool in the driver. So when the task is running in the executor, it will always see an new empty KafkaProducerPool and create KafkaProducers. But nobody closes these KafkaProducers. On Tue, Jan 31, 2017 at 3:02 PM, Nipun Arora wrote:

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Sorry for not writing the patch number, it's spark 1.6.1. The relevant code is here inline. Please have a look and let me know if there is a resource leak. Please also let me know if you need any more details. Thanks Nipun The JavaRDDKafkaWriter code is here inline: import

Re: Kafka dependencies in Eclipse project /Pls assist

2017-01-31 Thread Cody Koeninger
spark-streaming-kafka-0-10 has a transitive dependency on the kafka library, you shouldn't need to include kafka explicitly. What's your actual list of dependencies? On Tue, Jan 31, 2017 at 3:49 PM, Marco Mistroni wrote: > HI all > i am trying to run a sample spark code

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
Please also include the patch version, such as 1.6.0, 1.6.1. Could you also post the JAVARDDKafkaWriter codes. It's also possible that it leaks resources. On Tue, Jan 31, 2017 at 2:12 PM, Nipun Arora wrote: > It is spark 1.6 > > Thanks > Nipun > > On Tue, Jan 31, 2017

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
It is spark 1.6 Thanks Nipun On Tue, Jan 31, 2017 at 1:45 PM Shixiong(Ryan) Zhu wrote: > Could you provide your Spark version please? > > On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora > wrote: > > Hi, > > I get a resource leak, where the

Re: HBase Spark

2017-01-31 Thread Benjamin Kim
Elek, If I cannot use the HBase Spark module, then I’ll give it a try. Thanks, Ben > On Jan 31, 2017, at 1:02 PM, Marton, Elek wrote: > > > I tested this one with hbase 1.2.4: > > https://github.com/hortonworks-spark/shc > > Marton > > On 01/31/2017 09:17 PM, Benjamin Kim

Kafka dependencies in Eclipse project /Pls assist

2017-01-31 Thread Marco Mistroni
HI all i am trying to run a sample spark code which reads streaming data from Kafka I Have followed instructions here https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html Here's my setup Spark: 2.0.1 Kafka:0.10.1.1 Scala Version: 2.11 Libraries used -

Re: Question about "Output Op Duration" in SparkStreaming Batch details UX

2017-01-31 Thread Shixiong(Ryan) Zhu
It means the total time to run a batch, including the Spark job duration + time spent on the driver. E.g., foreachRDD { rdd => rdd.count() // say this takes 1 second. Thread.sleep(1) // sleep 10 seconds } In the above example, the Spark job duration is 1 seconds and the output op

increasing cross join speed

2017-01-31 Thread Kürşat Kurt
Hi; I have 2 dataframes. I am trying to cross join for finding vector distances. Then i can choose the most similiar vectors. Cross join speed is too slow. How can i increase the speed, or have you any suggestion for this comparision? val result=myDict.join(mainDataset).map(x=>{

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Burak Yavuz
Hi Koert, When eager is true, we return you a new DataFrame that depends on the files written out to the checkpoint directory. All previous operations on the checkpointed DataFrame are gone forever. You basically start fresh. AFAIK, when eager is true, the method will not return until the

Re: HBase Spark

2017-01-31 Thread Marton, Elek
I tested this one with hbase 1.2.4: https://github.com/hortonworks-spark/shc Marton On 01/31/2017 09:17 PM, Benjamin Kim wrote: Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I tried to build it from source, but I cannot get it to work. Thanks, Ben

Question about "Output Op Duration" in SparkStreaming Batch details UX

2017-01-31 Thread satishl
For Spark Streaming Apps, what does "Output Op Duration" in the batch details UX signify? We have been observing that - for the given batch's last output Op id - Output Op duration > Job duration by a factor. Sometimes it is huge (1 min). I have provided the screenshot below where - you can see

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Koert Kuipers
i understand that checkpoint cuts the lineage, but i am not fully sure i understand the role of eager. eager simply seems to materialize the rdd early with a count, right after the rdd has been checkpointed. but why is that useful? rdd.checkpoint is asynchronous, so when the rdd.count happens

Re: Spark 2.1.0 and Shapeless

2017-01-31 Thread Koert Kuipers
shading at the fat jar level can work, however it means that in your unit tests where spark is a provided dependency you still can get errors because spark is using an incompatible (newer) shapeless version. the unit tests run with a single resolved shapeless after all. for example spark ships

HBase Spark

2017-01-31 Thread Benjamin Kim
Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I tried to build it from source, but I cannot get it to work. Thanks, Ben - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.1.0 and Shapeless

2017-01-31 Thread Phil Wills
Are you not able to shade it when you're building your fat jar with something like https://github.com/sbt/sbt-assembly#shading? I would have thought doing the shading at the app level would be a bit less painful than doing it at the library level. Phil On Tue, 31 Jan 2017, 04:24 Timothy Chan,

Re: ML version of Kmeans

2017-01-31 Thread Hollin Wilkins
Hey, You could also take a look at MLeap, which provides a runtime for any Spark transformer and does not have any dependencies on a SparkContext or Spark libraries (excepting MLlib-local for linear algebra). https://github.com/combust/mleap On Tue, Jan 31, 2017 at 2:33 AM, Aseem Bansal

JavaRDD text matadata(file name) findings

2017-01-31 Thread Manohar753
Hi All, myspark job is reading data from a folder having different files with same structured data. the red JavaRdd processed line by line but is there any way to know from which file the line of data came. Team thank you in advance for your reply coming. Thanks, -- View this message in

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
Could you provide your Spark version please? On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora wrote: > Hi, > > I get a resource leak, where the number of file descriptors in spark > streaming keeps increasing. We end up with a "too many file open" error > eventually

Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Hi, I get a resource leak, where the number of file descriptors in spark streaming keeps increasing. We end up with a "too many file open" error eventually through an exception caused in: JAVARDDKafkaWriter, which is writing a spark JavaDStream The exception is attached inline. Any help will be

Re: Jars directory in Spark 2.0

2017-01-31 Thread Koert Kuipers
you basically have to keep your versions of dependencies in line with sparks or shade your own dependencies. you cannot just replace the jars in sparks jars folder. if you wan to update them you have to build spark yourself with updated dependencies and confirm it compiles, passes tests etc. On

Unique Partition Id per partition

2017-01-31 Thread Chawla,Sumit
Hi All I have a rdd, which i partition based on some key, and then can sc.runJob for each partition. Inside this function, i assign each partition a unique key using following: "%s_%s" % (id(part), int(round(time.time())) This is to make sure that, each partition produces separate bookeeping

Multiple quantile calculations

2017-01-31 Thread Aaron Perrin
I want to calculate quantiles on two different columns. I know that I can calculate them with two separate operations. However, for performance reasons, I'd like to calculate both with one operation. Is this possible to do this with the Dataset API? I'm assuming that it isn't. But, if it isn't,

Roadblock -- stuck for 10 days :( how come same hive udf giving different results in spark and hive

2017-01-31 Thread Alex
Hi All, i am trying to run a hive udf in spark-sql and its giving different rows as result in both hive and spark.. My UDF query looks something like this select col1,col2,col3, sum(col4) col4, sum(col5) col5,Group_name from (select inline(myudf('cons1',record)) from table1) test group by

Error when loading saved ml model on pyspark (2.0.1)

2017-01-31 Thread Matheus Braun Magrin
Hi there, I've posted this question on StackOverflow as well but I got no answers, maybe you guys can help me out. I'm building a Random Forest model using Spark and I want to save it to use again later. I'm running this on pyspark (Spark 2.0.1) without HDFS, so the files are saved to the local

Re: ML version of Kmeans

2017-01-31 Thread Aseem Bansal
If you want to predict using dataset then transform is the way to go. If you want to predict on vectors then you will have to wait on this issue to be completed https://issues.apache.org/jira/browse/SPARK-10413 On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau wrote: > You

Re: ML version of Kmeans

2017-01-31 Thread Holden Karau
You most likely want the transform function on KMeansModel (although that works on a dataset input rather than a single element at a time). On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I am not able to find predict method on "ML" version of

ML version of Kmeans

2017-01-31 Thread Madabhattula Rajesh Kumar
Hi, I am not able to find predict method on "ML" version of Kmeans. Mllib version has a predict method. KMeansModel.predict(point: Vector) . How to predict the cluster point for new vectors in ML version of kmeans ? Regards, Rajesh

Jars directory in Spark 2.0

2017-01-31 Thread Sidney Feiner
Hey, While migrating to Spark 2.X from 1.6, I've had many issues with jars that come preloaded with Spark in the "jars/" directory and I had to shade most of my packages. Can I replace the jars in this folder to more up to date versions? Are those jar used for anything internal in Spark which