Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Yes, that's it! Thank you, Ayan. On Tue, Sep 13, 2016 at 5:50 PM, ayan guha wrote: > >>> df.show() > > ++---+---+-+-+-+ > |city|flg| id| nbr|price|state| > ++---+---+-+-+-+ > | CA| 0| 1| 1000| 100|A| > | CA| 1| 2| 1010| 96|

Re: Using Spark SQL to Create JDBC Tables

2016-09-13 Thread ayan guha
I did not install myself, as it is part of Oracle's product, However, you can bring in any SerDe yourself and add them to library. See this blog for more information. On Wed, Sep 14, 2016 at 2:15 PM, Benjamin Kim

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Sure, and please post back if it works (or it does not :) ) On Wed, Sep 14, 2016 at 2:09 PM, Saliya Ekanayake wrote: > Thank you, I'll try. > > saliya > > On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote: > >> Depends on join, but unless you are doing

Re: Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim
Thank you for the idea. I will look for a PostgreSQL Serde for Hive. But, if you don’t mind me asking, how did you install the Oracle Serde? Cheers, Ben > On Sep 13, 2016, at 7:12 PM, ayan guha wrote: > > One option is have Hive as the central point of exposing data ie

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Thank you, I'll try. saliya On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote: > Depends on join, but unless you are doing cross join, it should not blow > up. 6M is not too much. I think what you may want to consider (a) volume of > your data files (b) reduce shuffling by

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Depends on join, but unless you are doing cross join, it should not blow up. 6M is not too much. I think what you may want to consider (a) volume of your data files (b) reduce shuffling by following similar partitioning on both RDDs On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake

Re: Can I assign affinity for spark executor processes?

2016-09-13 Thread Jakob Odersky
Hi Xiaoye, could it be that the executors were spawned before the affinity was set on the worker? Would it help to start spark worker with taskset from the beginning, i.e. "taskset [mask] start-slave.sh"? Workers in spark (standalone mode) simply create processes with the standard java process

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Thank you, but isn't that join going to be too expensive for this? On Tue, Sep 13, 2016 at 11:55 PM, ayan guha wrote: > My suggestion: > > 1. Read first text file in (say) RDD1 using textFile > 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of > signature

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
My suggestion: 1. Read first text file in (say) RDD1 using textFile 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of signature (filename,filecontent). 3. Join RDD1 and 2 based on some file name (or some other key). On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
1.) What needs to be parallelized is the work for each of those 6M rows, not the 80K files. Let me elaborate this with a simple for loop if we were to write this serially. For each line L out of 6M in the first file{ process the file corresponding to L out of those 80K files. } The 80K

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Question: 1. Why you can not read all 80K files together? ie, why you have a dependency on first text file? 2. Your first text file has 6M rows, but total number of files~80K. is there a scenario where there may not be a file in HDFS corresponding to the row in first text file? 3. May be a follow

Can I assign affinity for spark executor processes?

2016-09-13 Thread Xiaoye Sun
Hi, In my experiment, I pin one very important process on a fixed CPU. So the performance of Spark task execution will be affected if the executors or the worker uses that CPU. I am wondering if it is possible to let the Spark executors not using a particular CPU. I tried to 'taskset -p

Can I assign affinity for spark executor processes?

2016-09-13 Thread Xiaoye Sun
Hi, In my experiment, I pin one very important process on a fixed CPU. So the performance of Spark task execution will be affected if the executors or the worker uses that CPU. I am wondering if it is possible to let the Spark executors not using a particular CPU. I tried to 'taskset -p

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
The first text file is not that large, it has 6 million records (lines). For each line I need to read a file out of 8 files. They total around 1.5TB. I didn't understand what you meant by "then again read text files for each line and union all rdds." On Tue, Sep 13, 2016 at 10:04 PM,

Re: KafkaUtils.createDirectStream() with kafka topic expanded

2016-09-13 Thread Cody Koeninger
That version of createDirectStream doesn't handle partition changes. You can work around it by starting the job again. The spark 2.0 consumer for kafka 0.10 should handle partition changes via SubscribePattern. On Tue, Sep 13, 2016 at 7:13 PM, vinay gupta wrote: >

Re: Spark kafka integration issues

2016-09-13 Thread Cody Koeninger
1. see http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers look for HasOffsetRange. If you really want the info per-message rather than per-partition, createRDD has an overload that takes a messageHandler from MessageAndMetadata to

Re: Using Spark SQL to Create JDBC Tables

2016-09-13 Thread ayan guha
One option is have Hive as the central point of exposing data ie create hive tables which "point to" any other DB. i know Oracle provides there own Serde for hive. Not sure about PG though. Once tables are created in hive, STS will automatically see it. On Wed, Sep 14, 2016 at 11:08 AM, Benjamin

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Raghavendra Pandey
How large is your first text file? The idea is you read first text file and if it is not large you can collect all the lines on driver and then again read text files for each line and union all rdds. On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote: > Just wonder if this

Shuffle Spill (Memory) greater than Shuffle Spill (Disk)

2016-09-13 Thread prayag chandran
Hello! In my spark job, I see that Shuffle Spill (Memory) is greater than Shuffle Spill (Disk). spark.shuffle.compress parameter is left to default(true?). I would expect the size on disk to be smaller which isn't the case here. I've been having some performance issues as well and I suspect this

Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim
Has anyone created tables using Spark SQL that directly connect to a JDBC data source such as PostgreSQL? I would like to use Spark SQL Thriftserver to access and query remote PostgreSQL tables. In this way, we can centralize data access to Spark SQL tables along with PostgreSQL making it very

Re: Spark SQL Thriftserver

2016-09-13 Thread Takeshi Yamamuro
Hi, all Spark STS just uses HiveContext inside and does not use MR. Anyway, Spark STS misses some HiveServer2 functionalities such as HA (See: https://issues.apache.org/jira/browse/SPARK-11100) and has some known issues there. So, you'd better off checking all the jira issues related to STS for

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread ayan guha
>>> df.show() ++---+---+-+-+-+ |city|flg| id| nbr|price|state| ++---+---+-+-+-+ | CA| 0| 1| 1000| 100|A| | CA| 1| 2| 1010| 96|A| | CA| 1| 3| 1010| 195|A| | NY| 0| 4| 2000| 124|B| | NY| 1| 5| 2001| 128|B| | NY| 0|

KafkaUtils.createDirectStream() with kafka topic expanded

2016-09-13 Thread vinay gupta
Hi we are using the following version of KafkaUtils.createDirectStream() from spark 1.5.0 createDirectStream(JavaStreamingContext jssc, Class keyClass, Class valueClass,

Re: Spark SQL Thriftserver

2016-09-13 Thread ayan guha
Hi AFAIK STS uses Spark SQL and not Map Reduce. Is that not correct? Best Ayan On Wed, Sep 14, 2016 at 8:51 AM, Mich Talebzadeh wrote: > STS will rely on Hive execution engine. My Hive uses Spark execution > engine so STS will pass the SQL to Hive and let it do the

Spark kafka integration issues

2016-09-13 Thread Mukesh Jha
Hello fellow sparkers, I'm using spark to consume messages from kafka in a non streaming fashion. I'm suing the using spark-streaming-kafka-0-8_2.10 & sparkv2.0to do the same. I have a few queries for the same, please get back if you guys have clues on the same. 1) Is there anyway to get the

Re: Spark SQL Thriftserver

2016-09-13 Thread Mich Talebzadeh
STS will rely on Hive execution engine. My Hive uses Spark execution engine so STS will pass the SQL to Hive and let it do the work and return the result set which beeline /usr/lib/spark-2.0.0-bin-hadoop2.6/bin/beeline ${SPARK_HOME}/bin/beeline -u jdbc:hive2://rhes564:10055 -n hduser -p

Re: Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim
Mich, It sounds like that there would be no harm in changing then. Are you saying that using STS would still use MapReduce to run the SQL statements? What our users are doing in our CDH 5.7.2 installation is changing the execution engine to Spark when connected to HiveServer2 to get faster

Re: Spark SQL Thriftserver

2016-09-13 Thread Mich Talebzadeh
Hi, Spark Thrift server (STS) still uses hive thrift server. If you look at $SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2) function usage { echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]" pattern="usage" *pattern+="\|Spark assembly has been

Re: Spark Java Heap Error

2016-09-13 Thread Baktaawar
this is the settings I have. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory

Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim
Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 1.6.2 instead of HiveServer2? We are considering abandoning HiveServer2 for it. Some advice and gotcha’s would be nice to know. Thanks, Ben - To

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Jonathan Kelly
Yes, Spark on EMR runs on YARN, so there is only a Spark UI when a Spark app is running. To expand on what Natu says, the best way to view the Spark UI for both running and completed Spark apps is to start from the YARN ResourceManager UI (port 8088) and to click the "Application Master" link (for

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Hi Sean, Now let's assume we have column C and column D normalized, and the metric is simplified to abs( C1 - C2 ) + abs (D1 - D2) Can we benefit the performance from LSH? Thank you again! Best, Rex On Tue, Sep 13, 2016 at 12:47 PM, Sean Owen wrote: > Given the nature

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Sean Owen
Given the nature of your metric, I don't think you can use things like LSH which more or less depend on a continuous metric space. This is too specific to fit into a general framework usefully I think, but, I think you can solve this directly with some code without much trouble. On Tue, Sep 13,

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-13 Thread Daniel Lopes
Hi Mario, Thanks for your help, so I will keeping using CSVs Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Mon, Sep

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Hi Sean, Great! Is there any sample code implementing Locality Sensitive Hashing with Spark, in either scala or python? "However if your rule is really like "must match column A and B and then closest value in column C then just ordering everything by A, B, C lets you pretty much read off the

Re: Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Sean Owen
Based on your description, this isn't a problem in Spark. It means your JDBC connector isn't interpreting bytes from the database according to the encoding in which they were written. It could be Latin1, sure. But if "new String(ResultSet.getBytes())" works, it's only because your platform's

Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Mark Bittmann
Hello Spark community, I'm reading from a MySQL database into a Spark dataframe using the JDBC connector functionality, and I'm experiencing some character encoding issues. The default encoding for MySQL strings is latin1, but the mysql JDBC connector implementation of "ResultSet.getString()"

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Mark Hamstra
It sounds like you should be writing an application and not trying to force the spark-shell to do more than what it was intended for. On Tue, Sep 13, 2016 at 11:53 AM, Kevin Burton wrote: > I sort of agree but the problem is that some of this should be code. > > Some of our

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
I sort of agree but the problem is that some of this should be code. Some of our ES indexes have 100-200 columns. Defining which ones are arrays on the command line is going to get ugly fast. On Tue, Sep 13, 2016 at 11:50 AM, Sean Owen wrote: > You would generally use

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Sean Owen
You would generally use --conf to set this on the command line if using the shell. On Tue, Sep 13, 2016, 19:22 Kevin Burton wrote: > The problem is that without a new spark context, with a custom conf, > elasticsearch-hadoop is refusing to read in settings about the ES

Re: Check if a nested column exists in DataFrame

2016-09-13 Thread Arun Patel
Is there a way to check nested column exists from Schema in PySpark? http://stackoverflow.com/questions/37471346/automatically-and-elegantly-flatten-dataframe-in-spark-sql shows how to get the list of nested columns in Scala. But, can this be done in PySpark? Please help. On Mon, Sep 12, 2016

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
The problem is that without a new spark context, with a custom conf, elasticsearch-hadoop is refusing to read in settings about the ES setup... if I do a sc.stop() , then create a new one, it seems to work fine. But it isn't really documented anywhere and all the existing documentation is now

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Mich Talebzadeh
I think this works in a shell but you need to allow multiple spark contexts Spark context Web UI available at http://50.140.197.217:5 Spark context available as 'sc' (master = local, app id = local-1473789661846). Spark session available as 'spark'. Welcome to __

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Praseetha
Hi Mich, Even i'm getting similar output. The dates that are passed as input are different from the one in the output. Since its an inner join, the expected result is [2015-12-31,2015-12-31,1,105] [2016-01-27,2016-01-27,5,101] Thanks & Regds, --Praseetha On Tue, Sep 13, 2016 at 11:21 PM, Mich

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Sean Owen
The key is really to specify the distance metric that defines "closeness" for you. You have features that aren't on the same scale, and some that aren't continuous. You might look to clustering for ideas here, though mostly you just want to normalize the scale of dimensions to make them

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Just wonder if this is possible with Spark? On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake wrote: > Hi, > > I've got a text file where each line is a record. For each record, I need > to process a file in HDFS. > > So if I represent these records as an RDD and invoke a

Re: Spark_JDBC_Partitions

2016-09-13 Thread Suresh Thalamati
There is also another jdbc method in data frame reader api o specify your own predicates for each partition. Using this you can control what is included in each partition. val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 200") val df = spark.read.jdbc(

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Marcelo Vanzin
You're running spark-shell. It already creates a SparkContext for you and makes it available in a variable called "sc". If you want to change the config of spark-shell's context, you need to use command line option. (Or stop the existing context first, although I'm not sure how well that will

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Sean Owen
But you're in the shell there, which already has a SparkContext for you as sc. On Tue, Sep 13, 2016 at 6:49 PM, Kevin Burton wrote: > I'm rather confused here as to what to do about creating a new > SparkContext. > > Spark 2.0 prevents it... (exception included below) > >

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Hi Praseetha, This is how I have written this. case class TestDate (id: String, loginTime: java.sql.Date) val formate = new SimpleDateFormat("-MM-DD") val TestDateData = sc.parallelize(List( ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)), ("2", new

Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
I'm rather confused here as to what to do about creating a new SparkContext. Spark 2.0 prevents it... (exception included below) yet a TON of examples I've seen basically tell you to create a new SparkContext as standard practice:

Re: Spark Java Heap Error

2016-09-13 Thread Baktaawar
Data set is not big. It is 56K X 9K . It does have column names as long strings. It fits very easily in Pandas. That is also in memory thing. So I am not sure if memory is an issue here. If Pandas can fit it very easily and work on it very fast then Spark shouldnt have problems too right? ᐧ On

Re: Spark Java Heap Error

2016-09-13 Thread neil90
Im assuming the dataset your dealing with is big hence why you wanted to allocate ur full 16gb of Ram to it. I suggest running the python spark-shell as such "pyspark --driver-memory 16g". Also if you cache your data and it doesn't fully fit in memory you can do

What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Given a table > $cat data.csv > > ID,State,City,Price,Number,Flag > 1,CA,A,100,1000,0 > 2,CA,A,96,1010,1 > 3,CA,A,195,1010,1 > 4,NY,B,124,2000,0 > 5,NY,B,128,2001,1 > 6,NY,C,24,3,0 > 7,NY,C,27,30100,1 > 8,NY,C,29,30200,0 > 9,NY,C,39,33000,1

Re: Master OOM in "master-rebuild-ui-thread" while running stream app

2016-09-13 Thread Mariano Semelman
Thanks, I would go with log disabling. BTW, the master crashed while the application was still running. -- *Mariano Semelman* P13N - IT Av. Corrientes Nº 746 - piso 13 - C.A.B.A. (C1043AAU) Teléfono (54) 11- *4894-3500* [image: Seguinos en Twitter!]

Re: Spark_JDBC_Partitions

2016-09-13 Thread Rabin Banerjee
Trust me, Only thing that can help you in your situation is SQOOP oracle direct connector which is known as ORAOOP. Spark cannot do everything , you need a OOZIE workflow which will trigger sqoop job with oracle direct connector to pull the data then spark batch to process . Hope it helps !! On

Re: Spark Java Heap Error

2016-09-13 Thread Baktaawar
I put driver memory as 6gb instead of 8(half of 16). But does 2 gb make this difference? On Tuesday, September 13, 2016, neil90 [via Apache Spark User List] < ml-node+s1001560n27704...@n3.nabble.com> wrote: > Double check your Driver Memory in your Spark Web UI make sure the driver > Memory is

Re: Spark Java Heap Error

2016-09-13 Thread neil90
Double check your Driver Memory in your Spark Web UI make sure the driver Memory is close to half of 16gb available. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-Heap-Error-tp27669p27704.html Sent from the Apache Spark User List mailing list

Re: Master OOM in "master-rebuild-ui-thread" while running stream app

2016-09-13 Thread Bryan Cutler
It looks like you have logging enabled and your application event log is too large for the master to build a web UI from it. In spark 1.6.2 and earlier, when an application completes, the master rebuilds a web UI to view events after the fact. This functionality was removed in spark 2.0 and the

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Hi Praseetha. :32: error: not found: value formate Error occurred in an application involving default arguments. ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)), What is that formate? Thanks Dr Mich Talebzadeh LinkedIn *

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Praseetha
Hi Mich, Thanks a lot for your reply. Here is the sample case class TestDate (id: String, loginTime: java.sql.Date) val formate = new SimpleDateFormat("-MM-DD") val TestDateData = sc.parallelize(List( ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)),

Spark SQL - Applying transformation on a struct inside an array

2016-09-13 Thread Olivier Girardot
Hi everyone,I'm currently trying to create a generic transformation mecanism on a Dataframe to modify an arbitrary column regardless of the underlying the schema. It's "relatively" straightforward for complex types like struct> to apply an arbitrary UDF on the column and replace the data

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Ah, that makes it much clearer, thanks! It also brings up an additional question: who/what decides on the partitioning? Does Spark Streaming decide to divide a micro batch/RDD into more than 1 partition based on size? Or is it something that the "source" (SocketStream, KafkaStream etc.) decides?

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
The DStream implementation decides how to produce an RDD for a time (this is the compute method) The RDD implementation decides how to partition things (this is the getPartitions method) You can look at those methods in DirectKafkaInputDStream and KafkaRDD respectively if you want to see an

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
A micro batch is an RDD. An RDD has partitions, so different executors can work on different partitions concurrently. Don't think of that as multiple micro-batches within a time slot. It's one RDD within a time slot, with multiple partitions. On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie

Master OOM in "master-rebuild-ui-thread" while running stream app

2016-09-13 Thread Mariano Semelman
Hello everybody, I am running a spark streaming app and I am planning to use it as a long running service. However while trying the app in a rc environment I got this exception in the master daemon after 1 hour of running: ​​Exception in thread "master-rebuild-ui-thread"

Re: Strings not converted when calling Scala code from a PySpark app

2016-09-13 Thread Alexis Seigneurin
Makes sense. Thanks Holden. Alexis On Mon, Sep 12, 2016 at 5:28 PM, Holden Karau wrote: > Ah yes so the Py4J conversions only apply on the driver program - your > DStream however is RDDs of pickled objects. If you want to with a transform > function use Spark SQL

Re: Why there is no top method in dataset api

2016-09-13 Thread Jakub Dubovsky
Thanks Sean, the important part of your answer for me is that orderBy + limit is doing only "partial sort" because of optimizer. That's what I was missing. I will give it a try... J.D. On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen wrote: > ​No, ​ > I'm not advising you to use

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Thanks, but that thread does not answer my questions, which are about the distributed nature of RDDs vs the small nature of "micro batches" and on how Spark Streaming distributes work. On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh wrote: > Hi Daan, > > You may find

Fetching Hive table data from external cluster

2016-09-13 Thread Satish Chandra J
HI All, Currently using Spark 14.2 version Please provide inputs if anyone have encountered below mentioned scenario Fetching Hive table data from external Hadoop cluster into a Dataframe via Spark Job, I am interested in having data directly into a Dataframe and apply transformation on top of

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Mich Talebzadeh
Hi Daan, You may find this link Re: Is "spark streaming" streaming or mini-batch? helpful. This was a thread in this forum not long ago. HTH Dr Mich Talebzadeh LinkedIn *

Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread DandyDev
Hi all! When reading about Spark Streaming and its execution model, I see diagrams like this a lot: It does a fine job explaining how

Re: LDA spark ML visualization

2016-09-13 Thread janardhan shetty
Any help is appreciated to proceed in this problem. On Sep 12, 2016 11:45 AM, "janardhan shetty" wrote: > Hi, > > I am trying to visualize the LDA model developed in spark scala (2.0 ML) > in LDAvis. > > Is there any links to convert the spark model parameters to the

Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Mark Bittmann
Hello Spark community, I'm reading from a MySQL database into a Spark dataframe using the JDBC connector functionality, and I'm experiencing some character encoding issues. The default encoding for MySQL stings is latin1, but the mysql JDBC connector implementation of "ResultSet.getString()" will

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Can you send the rdds that just creates those two dates? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Unable to compare SparkSQL Date columns

2016-09-13 Thread Praseetha
Hi All, I have a case class in scala case class TestDate (id: String, loginTime: java.sql.Date) I created 2 RDD's of type TestDate I wanted to do an inner join on two rdd's where the values of loginTime column is equal. Please find the code snippet below,

Re: Spark_JDBC_Partitions

2016-09-13 Thread Igor Racic
Hi, One way can be to use NTILE function to partition data. Example: REM Creating test table create table Test_part as select * from ( select rownum rn from all_tables t1 ) where rn <= 1000; REM Partition lines by Oracle block number, 11 partitions in this example. select ntile(11) over( order

Spark SQL - Actions and Transformations

2016-09-13 Thread brccosta
Dear all, We're performing some tests with cache and persist in datasets. In RDD, we know that the transformations are lazy, being executed only when an action occurs. So, for example, we put a .cache() in a RDD after an action, which in turn is executed as the last operations of a sequence of

Any viable DATEDIFF function in Spark/Scala

2016-09-13 Thread Mich Talebzadeh
Hi, This tricky bit. I use the following to get the current data and time scala> val date = java.time.LocalDate.now.toString date: String = 2016-09-13 scala> val hour = java.time.LocalTime.now.toString hour: String = 11:49:13.577 I store a column called TIMECREATED as String in hdfs. For now

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Natu Lauchande
Hi, I think the spark UI will be accessible whenever you launch a spark app in the cluster it should be the Application Tracker link. Regards, Natu On Tue, Sep 13, 2016 at 9:37 AM, Divya Gehlot wrote: > Hi , > Thank you all.. > Hurray ...I am able to view the hadoop

Re: Spark with S3 DirectOutputCommitter

2016-09-13 Thread Steve Loughran
On 12 Sep 2016, at 19:58, Srikanth > wrote: Thanks Steve! We are already using HDFS as an intermediate store. This is for the last stage of processing which has to put data in S3. The output is partitioned by 3 fields, like

Re: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-13 Thread Steve Loughran
On 12 Sep 2016, at 13:04, Daniel Lopes > wrote: Thanks Steve, But this error occurs only with parquet files, CSVs works. out my depth then, I'm afraid. sorry Best, Daniel Lopes Chief Data and Analytics Officer | OneMatch c: +55 (18)

Re: Zeppelin patterns with the streaming data

2016-09-13 Thread Mich Talebzadeh
Hi Chanh, Yes indeed. Apparently it is implemented through a class of its own. I have specified a refresh of every 15 seconds. Obviously if there is an issue then the cron will not be able to refresh but you cannot sort out that problem from the web page anyway Thanks Dr Mich Talebzadeh

Re: Zeppelin patterns with the streaming data

2016-09-13 Thread Chanh Le
Hi Mich, I think it can http://www.quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/crontrigger > On Sep 13, 2016, at 1:57 PM, Mich Talebzadeh > wrote: > > Thanks

Re: Ways to check Spark submit running

2016-09-13 Thread Deepak Sharma
Use yarn-client mode and you can see the logs n console after you submit. On Tue, Sep 13, 2016 at 11:47 AM, Divya Gehlot wrote: > Hi, > > Some how for time being I am unable to view Spark Web UI and Hadoop Web > UI. > Looking for other ways ,I can check my job is

Re: Partition n keys into exacly n partitions

2016-09-13 Thread Christophe Préaud
Hi, A custom partitioner is indeed the solution. Here is a sample code: import org.apache.spark.Partitioner class KeyPartitioner(keyList: Seq[Any]) extends Partitioner { def numPartitions: Int = keyList.size + 1 def getPartition(key: Any): Int = keyList.indexOf(key) + 1 override def

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Divya Gehlot
Hi , Thank you all.. Hurray ...I am able to view the hadoop web UI now @ 8088 . even Spark Hisroty server Web UI @ 18080 But unable to figure out the Spark UI web port ... Tried with 4044,4040.. .. getting below error This site can’t be reached How can I find out the Spark port ? Would really

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Jonathan Kelly
I would not recommend opening port 50070 on your cluster, as that would give the entire world access to your data on HDFS. Instead, you should follow the instructions found here to create a secure tunnel to the cluster, through which you can proxy requests to the UIs using a browser plugin like

Ways to check Spark submit running

2016-09-13 Thread Divya Gehlot
Hi, Some how for time being I am unable to view Spark Web UI and Hadoop Web UI. Looking for other ways ,I can check my job is running fine apart from keep checking current yarn logs . Thanks, Divya