Re: Question on using pseudo columns in spark jdbc options

2017-12-07 Thread रविशंकर नायर
It works perfectly. You can use pseudo columns like ROWNUM in Oracle and RRN in DB2. To avoid skewing you can apply the great coalesce function...Spark is sparkling.. Best, On Thu, Dec 7, 2017 at 2:20 PM, Tomasz Dudek wrote: > Hey Ravion, > > yes, you can

Re: Row Encoder For DataSet

2017-12-07 Thread Georg Heiler
You are looking for an UADF. Sandip Mehta schrieb am Fr. 8. Dez. 2017 um 06:20: > Hi, > > I want to group on certain columns and then for every group wants to apply > custom UDF function to it. Currently groupBy only allows to add aggregation > function to GroupData.

Re: Row Encoder For DataSet

2017-12-07 Thread Sandip Mehta
Hi, I want to group on certain columns and then for every group wants to apply custom UDF function to it. Currently groupBy only allows to add aggregation function to GroupData. For this was thinking to use groupByKey which will return KeyValueDataSet and then apply UDF for every group but

[Spark SQL]: Dataset can not map into Dataset in java

2017-12-07 Thread Himasha de Silva
Hi, I'm trying to map a Dataset that read from csv files into a Dataset. But it gives some errors. Can anyone please help me to figure it out? Dataset t_en_data = session.read().option("header","true") .option("inferSchema","true") .csv("J:\\csv_path\\T_EN"); Dataset

Re: Row Encoder For DataSet

2017-12-07 Thread Weichen Xu
You can groupBy multiple columns on dataframe, so why you need so complicated schema ? suppose df schema: (x, y, u, v, z) df.groupBy($"x", $"y").agg(...) Is this you want ? On Fri, Dec 8, 2017 at 11:51 AM, Sandip Mehta wrote: > Hi, > > During my aggregation I end

[Spark SQL]: Dataset can not map into Dataset in java

2017-12-07 Thread Himasha de Silva
Hi, I'm new to spark. I'm trying to map a Dataset that read from csv files into a Dataset. But it gives some errors. Can anyone please help me to figure it out? my code, csv file and error log attached here. Thank you. -- Himasha De Silva Undergraduate, Department of Computer Engineering,

RDD[internalRow] -> DataSet

2017-12-07 Thread satyajit vegesna
Hi All, Is there a way to convert RDD[internalRow] to Dataset , from outside spark sql package. Regards, Satyajit.

Row Encoder For DataSet

2017-12-07 Thread Sandip Mehta
Hi, During my aggregation I end up having following schema. Row(Row(val1,val2), Row(val1,val2,val3...)) val values = Seq( (Row(10, 11), Row(10, 2, 11)), (Row(10, 11), Row(10, 2, 11)), (Row(20, 11), Row(10, 2, 11)) ) 1st tuple is used to group the relevant records for

Re: Spark job only starts tasks on a single node

2017-12-07 Thread Ji Yan
This used to work. Only thing that has changed is that the mesos installed on Spark executor is on a different version from before. My Spark executor runs in a container, the image of which has mesos installed. The version of that mesos is actually different from the version of mesos master. Not

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Qiao, Richard
For your question of example, the answer is yes. “For example, if an application wanted 4 executors (spark.executor.instances=4) but the spark cluster can only provide 1 executor. This means that I will only receive 1 onExecutorAdded event. Will the application state change to

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Qiao, Richard
For #2, do you mean “RUNNING” showing in “Driver” table? If yes, that is not a problem, because driver does run, while there is no executor available, as can be a status for you to catch – Driver running while no executors. Comparing #1 and #3, my understanding of “submitted” is “the jar is

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Marcelo Vanzin
That's the Spark Master's view of the application. I don't know exactly what it means in the different run modes, I'm more familiar with YARN. But I wouldn't be surprised if, as with others, it mostly tracks the driver's state. On Thu, Dec 7, 2017 at 12:06 PM, bsikander

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread bsikander
See the image. I am referring to this state when I say "Application State". -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: sparkSession.sql("sql query") vs df.sqlContext().sql(this.query) ?

2017-12-07 Thread khathiravan raj maadhaven
Hi Kant, Based on my understanding, I think the only difference is the overhead of the selection/creation of SqlContext for the query you have passed. As the table / view is already available for use, sparkSession.sql('your query') should be simple & good enough. Following uses the

Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

2017-12-07 Thread Sergey Zhemzhitsky
Hi PySparkers, What currently is the best way of shipping self-contained pyspark jobs with 3rd-party dependencies? There are some open JIRA issues [1], [2] as well as corresponding PRs [3], [4] and articles [5], [6], regarding setting up the python environment with conda and virtualenv

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Marcelo Vanzin
On Thu, Dec 7, 2017 at 11:40 AM, bsikander wrote: > For example, if an application wanted 4 executors > (spark.executor.instances=4) but the spark cluster can only provide 1 > executor. This means that I will only receive 1 onExecutorAdded event. Will > the application state

Re: Spark job only starts tasks on a single node

2017-12-07 Thread Art Rand
Sounds a little like the driver got one offer when it was using zero resources, then it's not getting any more. How many frameworks (and which) are running on the cluster? The Mesos Master log should say which frameworks are getting offers, and should help diagnose the problem. A On Thu, Dec 7,

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread bsikander
Marcelo Vanzin wrote > I'm not sure I follow you here. This is something that you are > defining, not Spark. Yes, you are right. In my code, 1) my notion of RUNNING is that both driver + executors are in RUNNING state. 2) my notion of WAITING is if any one of driver/executor is in WAITING state.

Re: Question on using pseudo columns in spark jdbc options

2017-12-07 Thread Tomasz Dudek
Hey Ravion, yes, you can obviously specify other column than a primary key. Be aware though, that if the key range is not spread evenly (for example in your code, if there's a "gap" in primary keys and no row has id between 0 and 17220) some of the executors may not assist in loading data

Re: Streaming Analytics/BI tool to connect Spark SQL

2017-12-07 Thread Pierce Lamb
Hi Umar, While this answer is a bit dated, you make find it useful in diagnosing a store for Spark SQL tables: https://stackoverflow.com/a/39753976/3723346 I don't know much about Pentaho or Arcadia, but I assume many of the listed options have a JDBC or ODBC client. Hope this helps, Pierce

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread Qiao, Richard
Kant, right, we cannot use Driver’s producer in executor. That’s I mentioned “kafka sink” to solve it. This article should be helpful about it https://allegro.tech/2015/08/spark-kafka-integration.html Best Regards Richard From: kant kodali Date: Thursday, December 7, 2017

Re: Spark job only starts tasks on a single node

2017-12-07 Thread Susan X. Huynh
Sounds strange. Maybe it has to do with the job itself? What kind of job is it? Have you gotten it to run on more than one node before? What's in the spark-submit command? Susan On Wed, Dec 6, 2017 at 11:21 AM, Ji Yan wrote: > I am sure that the other agents have plentiful

Streaming Analytics/BI tool to connect Spark SQL

2017-12-07 Thread umargeek
Hi All, We are currently looking for real-time streaming analytics of data stored as Spark SQL tables is there any external connectivity available to connect with BI tools(Pentaho/Arcadia). currently, we are storing data into the hive tables but its response on the Arcadia dashboard is slow.

Re: How to write dataframe to kafka topic in spark streaming application using pyspark other than collect?

2017-12-07 Thread umargeek
Hi Team, Can someone please advise me on the above post since because of this I have written data file to HDFS location. So as of now am just passing the filename into Kafka topic and not utilizing Kafka potential at the best looking forward to suggestions. Thanks, Umar -- Sent from:

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread kant kodali
Hi Richard, I had tried your sample code now and several times in the past as well. The problem seems to be kafkaProducer is not serializable. so I get "Task not serializable exception" and my kafkaProducer object is created using the following jar. group: 'org.apache.kafka', name:

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread Qiao, Richard
Thanks for sharing the code. The 1st problem in the first code is the map is allocated in Driver, but it’s trying to put data in Executors, then retrieve it in driver to send to Kafka. You are using this map as accumulator’s feature, but it doesn’t work in this way. The 2nd problem is both

Re: LDA and evaluating topic number

2017-12-07 Thread Stephen Boesch
I have been testing on the 20 NewsGroups dataset - which the Spark docs themselves reference. I can confirm that perplexity increases and likelihood decreases as topics increase - and am similarly confused by these results. 2017-09-28 10:50 GMT-07:00 Cody Buntain : > Hi,