parsing embedded json in spark

2016-12-21 Thread Tal Grynbaum
Hi, I have a dataframe that contain an embedded json string in one of the fields I'd tried to write a UDF function that will parse it using lift-json, but it seems to take a very long time to process, and it seems that only the master node is working. Has anyone dealt with such a scenario before

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Divya Gehlot
Hi Mich, Can you try placing these jars in Spark Classpath. It should work . Thanks, Divya On 22 December 2016 at 05:40, Mich Talebzadeh wrote: > This works with Spark 2 with Oracle jar file added to > > $SPARK_HOME/conf/ spark-defaults.conf > > > > >

submit spark task on yarn asynchronously via java?

2016-12-21 Thread Linyuxin
Hi All, Version: Spark 1.5.1 Hadoop 2.7.2 Is there any way to submit and monitor spark task on yarn via java asynchronously?

Access HiveConf from SparkSession

2016-12-21 Thread Vishak Baby
In Spark 1.6.2, it was possible to access the HiveConf object via the below method. https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/hive/HiveContext.html#hiveconf() Can anyone let me know how do the same in Spark 2.0.2, from the SparkSession object?

Access HiveConf from SparkSession

2016-12-21 Thread Vishak
In Spark 1.6.2, it was possible to access the HiveConf object via the below method. https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/hive/HiveContext.html#hiveconf() Can anyone let me know how do the same in Spark 2.0.2, from the SparkSession object? -- View this message in

In PySpark ML, how can I interpret the SparseVector returned by a pyspark.ml.classification.RandomForestClassificationModel.featureImportances ?

2016-12-21 Thread Russell Jurney
I am debugging problems with a PySpark RandomForestClassificationModel, and I am trying to use the feature importances to do so. However, the featureImportances property returns a SparseVector that isn't possible to interpret. How can I transform the SparseVector to be a useful list of features

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Mich Talebzadeh
thanks Ayan, do you mean "driver" -> "oracle.jdbc.OracleDriver" we added that one but did not work! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread ayan guha
Try providing correct driver name through property variable in the jdbc call. On Thu., 22 Dec. 2016 at 8:40 am, Mich Talebzadeh wrote: > This works with Spark 2 with Oracle jar file added to > > > > > > $SPARK_HOME/conf/ spark -defaults.conf > > > > > > > > > > > > > >

Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Mich Talebzadeh
This works with Spark 2 with Oracle jar file added to $SPARK_HOME/conf/ spark-defaults.conf spark.driver.extraClassPath /home/hduser/jars/ojdbc6.jar spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar and you get cala> val s = HiveContext.read.format("jdbc").options(

Re: SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

2016-12-21 Thread David Hodeffi
Do you know who can I talk to about this code? I am rally curious to know why there is a join and why number of partition for join is the sum of both of them, I expected to see that number of partitions should be the same as the streamed table ,or worst case multiplied. Sent from my iPhone

spark-shell fails to redefine values

2016-12-21 Thread Yang
summary: Spark-shell fails to redefine values in some cases, this is at least found in a case where "implicit" is involved, but not limited to such cases run the following in spark-shell, u can see that the last redefinition does not take effect. the same code runs in plain scala REPL without

spark linear regression error training dataset is empty

2016-12-21 Thread Xiaomeng Wan
Hi, I am running linear regression on a dataframe and get the following error: Exception in thread "main" java.lang.AssertionError: assertion failed: Training dataset is empty. at scala.Predef$.assert(Predef.scala:170) at

Re: Parquet with group by queries

2016-12-21 Thread Anil Langote
I tried caching the parent data set but it slows down the execution time, last column in the input data set is double array and requirement is to add last column double array after doing group by. I have implemented an aggregation function which adds the last column. Hence the query is Select

Parquet with group by queries

2016-12-21 Thread Anil Langote
Hi All, I have an requirement where I have to run 100 group by queries with different columns I have generated the parquet file which has 30 columns I see every parquet files has different size and 200 files are generated, my question is what is the best approach to run group by queries on

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
Incremental load traditionally means generating hfiles and using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the data into hbase. For your use case, the producer needs to find rows where the flag is 0 or 1. After such rows are obtained, it is up to you how the result of

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Vadim Semenov
Check the source code for SparkLauncher: https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java#L541 a separate process will be started using `spark-submit` and if it uses `yarn-cluster` mode, a driver may be launched on another NodeManager

Re: Spark kryo serialization register Datatype[]

2016-12-21 Thread Georg Heiler
I already set .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") to enable kryo and .set("spark.kryo.registrationRequired", "true") to force kryo. Strangely, I see the issue of this missing Dataset[] Trying to register regular classes like Date

Re: NoClassDefFoundError

2016-12-21 Thread Vadim Semenov
You better ask folks in the spark-jobserver gitter channel: https://github.com/spark-jobserver/spark-jobserver On Wed, Dec 21, 2016 at 8:02 AM, Reza zade wrote: > Hello > > I've extended the JavaSparkJob (job-server-0.6.2) and created an object > of SQLContext class. my

Re: Spark kryo serialization register Datatype[]

2016-12-21 Thread Vadim Semenov
to enable kryo serializer you just need to pass `spark.serializer=org.apache.spark.serializer.KryoSerializer` the `spark.kryo.registrationRequired` controls the following behavior: Whether to require registration with Kryo. If set to 'true', Kryo will > throw an exception if an unregistered

Re: ML PIC

2016-12-21 Thread Robert Hamilton
Thank you Nick that is good to know. Would this have some opportunity for newbs (like me) to volunteer some time? Sent from my iPhone > On Dec 21, 2016, at 9:08 AM, Nick Pentreath wrote: > > It is part of the general feature parity roadmap. I can't recall offhand any

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
Ok, Sure will ask. But what would be generic best practice solution for Incremental load from HBASE. On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > I haven't used Gobblin. > You can consider asking Gobblin mailing list of the first option. > > The second option would

Re: ML PIC

2016-12-21 Thread Yanbo Liang
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the progress. On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath wrote: > It is part of the general feature parity roadmap. I can't recall offhand > any blocker reasons it's just resources > On Wed, 21

streaming performance

2016-12-21 Thread Mendelson, Assaf
am having trouble with streaming performance. My main problem is how to do a sliding window calculation where the ratio between the window size and the step size is relatively large (hundreds) without recalculating everything all the time. I created a simple example of what I am aiming at with

Re: [ On the use of Spark as 'storage system']

2016-12-21 Thread Sean Owen
Spark isn't a storage system -- it's a batch processing system at heart. To "serve" something means to run a distributed computation scanning partitions for an element and collect it to a driver and return it. Although that could be fast-enough for some definition of fast, it's going to be orders

[ On the use of Spark as 'storage system']

2016-12-21 Thread Enrico DUrso
Hello, I had a discussion today with a colleague who was saying the following: "We can use Spark as fast serving layer in our architecture, that is we can compute an RDD or even a dataset using Spark SQL, then we can cache it and offering to the front end layer an access to our application in

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
I haven't used Gobblin. You can consider asking Gobblin mailing list of the first option. The second option would work. On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri wrote: > Hello Guys, > > I would like to understand different approach for Distributed

Re: ML PIC

2016-12-21 Thread Nick Pentreath
It is part of the general feature parity roadmap. I can't recall offhand any blocker reasons it's just resources On Wed, 21 Dec 2016 at 17:05, Robert Hamilton wrote: > Hi all. Is it on the roadmap to have an > Spark.ml.clustering.PowerIterationClustering? Are there

ML PIC

2016-12-21 Thread Robert Hamilton
Hi all. Is it on the roadmap to have an Spark.ml.clustering.PowerIterationClustering? Are there technical reasons that there is currently only an .mllib version? Sent from my iPhone - To unsubscribe e-mail:

Spark kryo serialization register Datatype[]

2016-12-21 Thread geoHeil
To force spark to use kryo serialization I set spark.kryo.registrationRequired to true. Now spark complains that: Class is not registered: org.apache.spark.sql.types.DataType[] is not registered. How can I fix this? So far I could not successfully register this class. -- View this message in

Writing into parquet throws Array out of bounds exception

2016-12-21 Thread Selvam Raman
Hi, When i am trying to write dataset to parquet or to show(1,fase), my job throws array out of bounce exception. 16/12/21 12:38:50 WARN TaskSetManager: Lost task 7.0 in stage 36.0 (TID 81, ip-10-95-36-69.dev): java.lang.ArrayIndexOutOfBoundsException: 63 at

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Thanks Liang! I get your point. It would mean that when launching spark jobs, mode needs to be specified as client for all spark jobs. However, my concern is to know if driver's memory(which is launching spark jobs) will be used completely by the Future's(sparkcontext's) or these spawned

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Hi Sebastian, Yes, for fetching the details from Hive and HBase, I would want to use Spark's HiveContext etc. However, based on your point, I might have to check if JDBC based driver connection could be used to do the same. Main reason for this is to avoid a client-server architecture design.

NoClassDefFoundError

2016-12-21 Thread Reza zade
Hello I've extended the JavaSparkJob (job-server-0.6.2) and created an object of SQLContext class. my maven project doesn't have any problem during compile and packaging phase. but when I send .jar of project to sjs and run it "NoClassDefFoundError" will be issued. the trace of exception is :

SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

2016-12-21 Thread David Hodeffi
I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin. number of partitions for the big Dataframe is 200 while small Dataframe has 5 partitions. The joined dataframe results with 205 partitions

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Sebastian Piu
Is there any reason you need a context on the application launching the jobs? You can use SparkLauncher in a normal app and just listen for state transitions On Wed, 21 Dec 2016, 11:44 Naveen, wrote: > Hi Team, > > Thanks for your responses. > Let me give more details in

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Hi Team, Thanks for your responses. Let me give more details in a picture of how I am trying to launch jobs. Main spark job will launch other spark-job similar to calling multiple spark-submit within a Spark driver program. These spawned threads for new jobs will be totally different components,

Re: Gradle dependency problem with spark

2016-12-21 Thread kant kodali
@Sean perhaps I could leverage this when this http://openjdk.java.net/jeps/261 becomes available. On Fri, Dec 16, 2016 at 4:05 AM, Steve Loughran wrote: > FWIW, although the underlying Hadoop declared guava dependency is pretty > low, everything in org.apache.hadoop is

Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
Hello Guys, I would like to understand different approach for Distributed Incremental load from HBase, Is there any *tool / incubactor tool* which satisfy requirement ? *Approach 1:* Write Kafka Producer and maintain manually column flag for events and ingest it with Linkedin Gobblin to HDFS /

RE: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread David Hodeffi
I am not familiar of any problem with that. Anyway, If you run spark applicaction you would have multiple jobs, which makes sense that it is not a problem. Thanks David. From: Naveen [mailto:hadoopst...@gmail.com] Sent: Wednesday, December 21, 2016 9:18 AM To: d...@spark.apache.org;