Re: using spark to load a data warehouse in real time

2017-02-28 Thread Jörn Franke
I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not for loading in a DWH. You need also define what realtime means and what is needed there - it will differ from client to client significantly. From my experience, just SQL is not enough for the users

Re: Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
I've verified this is that issue, so please disregard. On Wed, Mar 1, 2017 at 1:07 AM, Justin Pihony wrote: > As soon as I posted this I found https://issues.apache. > org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at > it deeper now. > > On Wed,

Re: Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
As soon as I posted this I found https://issues.apache.org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at it deeper now. On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony wrote: > Run spark-shell --packages >

Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
Run spark-shell --packages datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an import of anything com.datastax. I have checked that the jar is listed among the classpaths and it is, albeit behind a spark URL. I'm wondering if added jars fail in windows due to this server

Re: Can't transform RDD for the second time

2017-02-28 Thread Mark Hamstra
foreachPartition is not a transformation; it is an action. If you want to transform an RDD using an iterator in each partition, then use mapPartitions. On Tue, Feb 28, 2017 at 8:17 PM, jeremycod wrote: > Hi, > > I'm trying to transform one RDD two times. I'm using

Re: RDD blocks on Spark Driver

2017-02-28 Thread Prithish
This is the command I am running: spark-submit --deploy-mode cluster --master yarn --class com.myorg.myApp s3://my-bucket/myapp-0.1.jar On Wed, Mar 1, 2017 at 12:22 AM, Jonathan Kelly wrote: > Prithish, > > It would be helpful for you to share the spark-submit command

Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Prithish
Thanks for your response Jonathan. Yes, this works. I also added another way of achieving this to the Stackoverflow post. Thanks for the help. On Tue, Feb 28, 2017 at 11:58 PM, Jonathan Kelly wrote: > Prithish, > > I saw you posted this on SO, so I responded there just

Re: Spark - Not contains on Spark dataframe

2017-02-28 Thread KhajaAsmath Mohammed
Hi, MY dataframe has records with below conditions but dataframe never gets filtered. I am always getting total count of original records even after using below filter function. Am i doing anything wrong here Note: I tied OR and || too def filterDatapointRawCountsDataFrame(datapoint_df:

Can't transform RDD for the second time

2017-02-28 Thread jeremycod
Hi, I'm trying to transform one RDD two times. I'm using foreachParition and embedded I have two map transformations on it. First time, it works fine and I get results, but second time I call map on it, it behaves like RDD has no elements. This is my code: val credentialsIdsScala:

Re: Why Spark cannot get the derived field of case class in Dataset?

2017-02-28 Thread Michael Armbrust
We only serialize things that are in the constructor. You would have access to it in the typed API (df.map(_.day)). I'd suggest making a factory method that fills these in and put them in the constructor if you need to get to it from other dataframe operations. On Tue, Feb 28, 2017 at 12:03 PM,

Re: How to use ManualClock with Spark streaming

2017-02-28 Thread Saisai Shao
I don't think using ManualClock is a right way to fix your problem here in Spark Streaming. ManualClock in Spark is mainly used for unit test, it should manually advance the time to make the unit test work. The usage looks different compared to the scenario you mentioned. Thanks Jerry On Tue,

Re: error in kafka producer

2017-02-28 Thread shyla deshpande
*My code:* producer.send(message, new Callback { override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = { log.info(s"producer send callback metadata: $metadata") log.info(s"producer send callback exception: $exception") }

Re: error in kafka producer

2017-02-28 Thread Marco Mistroni
This exception coming from a Spark program? could you share few lines of code ? kr marco On Tue, Feb 28, 2017 at 10:23 PM, shyla deshpande wrote: > producer send callback exception: > org.apache.kafka.common.errors.TimeoutException: > Expiring 1 record(s) for

Re: [Spark Kafka] API Doc pages for Kafka 0.10 not current

2017-02-28 Thread Cody Koeninger
The kafka-0-8 and kafka-0-10 integrations have conflicting dependencies. Last time I checked, Spark's doc publication puts everything all in one classpath, so publishing them both together won't work. I thought there was already a Jira ticket related to this, but a quick look didn't turn it up.

error in kafka producer

2017-02-28 Thread shyla deshpande
producer send callback exception: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for positionevent-6 due to 30003 ms has passed since batch creation plus linger time

RE: using spark to load a data warehouse in real time

2017-02-28 Thread Adaryl Wakefield
I’m actually trying to come up with a generalized use case that I can take from client to client. We have structured data coming from some application. Instead of dropping it into Hadoop and then using yet another technology to query that data, I just want to dump it into a relational MPP DW so

Re: spark-submit question

2017-02-28 Thread Marcelo Vanzin
You're either running a really old version of Spark where there might have been issues in that code, or you're actually missing some backslashes in the command you pasted in your message. On Tue, Feb 28, 2017 at 2:05 PM, Joe Olson wrote: >> Everything after the jar path is

Re: spark-submit question

2017-02-28 Thread Joe Olson
> Everything after the jar path is passed to the main class as parameters. I don't think that is accurate if your application arguments contain double dashes. I've tried with several permutations of with and without '\'s and newlines. Just thought I'd ask here before I have to re-configure and

How to configure global_temp database via Spark Conf

2017-02-28 Thread SRK
Hi, How to configure global_temp database via SparkConf? I know that its a System Preserved database. Can it be preserved via Spark Conf? Thanks, Swetha -- View this message in context:

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Henry Tremblay
We did this all the time at my last position. 1. We had unstructured data in S3. 2.We read directly from S3 and then gave structure to the data by a dataframe in Spark. 3. We wrote the results to S3 4. We used Redshift's super fast parallel ability to load the results into a table. Henry

Re: DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread John Desuvio
Since the data is in multiple JVMs, only 1 of them can be the driver. So I can parallelize the data from 1 of the VMs but don't have a way to do the same for the others. Or am I missing something? On Tue, Feb 28, 2017 at 3:53 PM, ayan guha wrote: > How about parallelize

Re: Spark runs out of memory with small file

2017-02-28 Thread Henry Tremblay
Cool! Now I understand how to approach this problem. At my last position, I don't think we did it quite efficiently. Maybe a blog post by me? Henry On 02/28/2017 01:22 AM, 颜发才(Yan Facai) wrote: Google is your friend, Henry.

Re: DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread ayan guha
How about parallelize and then union all of them to one data frame? On Wed, 1 Mar 2017 at 3:07 am, Sean Owen wrote: > Broadcasts let you send one copy of read only data to each executor. > That's not the same as a DataFrame and itseems nature means it doesnt make > sense to

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-28 Thread Shixiong(Ryan) Zhu
The REST APIs are not just for Spark history server. When an application is running, you can use the REST APIs to talk to Spark UI HTTP server as well. On Tue, Feb 28, 2017 at 10:46 AM, Parag Chaudhari wrote: > ping... > > > > *Thanks,Parag Chaudhari,**USC Alumnus (Fight

global_temp database not getting created in Spark 2.x

2017-02-28 Thread SRK
Hi, The global_temp database is not getting created when I try to use Spark 2.x. Do I need to create it manually or do I need any permissions to do the same? 17/02/28 12:08:09 INFO HiveMetaStore.audit: ugi=user12345 ip=unknown-ip-addr cmd=get_database: global_temp 17/02/28 12:08:09

Why Spark cannot get the derived field of case class in Dataset?

2017-02-28 Thread Yong Zhang
In the following example, the "day" value is in the case class, but I cannot get that in the Spark dataset, which I would like to use at runtime? Any idea? Do I have to force it to be present in the case class constructor? I like to derive it out automatically and used in the dataset or

Does monotonically_increasing_id generates the same id even when executor fails or being evicted out of memory

2017-02-28 Thread Lan Jiang
Hi, there I am trying to generate unique ID for each record in a dataframe, so that I can save the dataframe to a relational database table. My question is that when the dataframe is regenerated due to executor failure or being evicted out of cache, does the ID keeps the same as before?

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Mohammad Tariq
You could try this as a blueprint : Read the data in through Spark Streaming. Iterate over it and convert each RDD into a DataFrame. Use these DataFrames to perform whatever processing is required and then save that DataFrame into your target relational warehouse. HTH [image: --] Tariq,

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Mohammad Tariq
Hi Adaryl, You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better. [image: --] Tariq, Mohammad [image: https://]about.me/mti

Re: RDD blocks on Spark Driver

2017-02-28 Thread Jonathan Kelly
Prithish, It would be helpful for you to share the spark-submit command you are running. ~ Jonathan On Sun, Feb 26, 2017 at 8:29 AM Prithish wrote: > Thanks for the responses, I am running this on Amazon EMR which runs the > Yarn cluster manager. > > On Sat, Feb 25, 2017

Re: Is there any limit on number of tasks per stage attempt?

2017-02-28 Thread Parag Chaudhari
Thanks Jacek! *Thanks,Parag* On Fri, Feb 24, 2017 at 10:45 AM, Jacek Laskowski wrote: > Hi, > > Think it's the size of the type to count the partitions which I think is > Int. I don't think there's another reason. > > Jacek > > On 23 Feb 2017 5:01 a.m., "Parag Chaudhari"

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-28 Thread Parag Chaudhari
ping... *Thanks,Parag Chaudhari,**USC Alumnus (Fight On!)* *Mobile : (213)-572-7858* *Profile: http://www.linkedin.com/pub/parag-chaudhari/28/a55/254 * On Wed, Feb 22, 2017 at 7:54 PM, Parag Chaudhari wrote: >

RE: using spark to load a data warehouse in real time

2017-02-28 Thread Adaryl Wakefield
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need. My primary concern is the complexity of loading warehouses. I have a web development background so I have

Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Jonathan Kelly
Prithish, I saw you posted this on SO, so I responded there just now. See http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr/42516161#42516161 In short, an hdfs:// path can't be used to configure log4j because log4j knows nothing about hdfs. Instead, since you are

RE: Register Spark UDF for use with Hive Thriftserver/Beeline

2017-02-28 Thread Lavelle, Shawn
I forgot to mention, I can do: sparkSession.sql("create function z as 'QualityToString'"); prior to starting HiveThriftServer2 and that will register the UDF, but only in the default database. It won’t be present in other databases. I can register it again in the other databases as needed, but

Re: spark-submit question

2017-02-28 Thread Marcelo Vanzin
Everything after the jar path is passed to the main class as parameters. So if it's not working you're probably doing something wrong in your code (that you haven't posted). On Tue, Feb 28, 2017 at 7:05 AM, Joe Olson wrote: > For spark-submit, I know I can submit application

Spark Streaming problem with Yarn

2017-02-28 Thread Amjad ALSHABANI
Hi everyone, I m experiencing a problem with my spark streaming job when running it on yarn. The problem appears only when running this application in a Yarn queue along with other Tez/MR applications This problem is in processing time, which exceeds 1 minute for batches of 1 second. Normally

Spark - Not contains on Spark dataframe

2017-02-28 Thread KhajaAsmath Mohammed
Hi, Could anyone please provide me your suggestions on how to resolve the issue that I am facing with not contains code on dataframe column. Here is the code. My dataframe is not getting filtered with below conditions. I even tried not and ! on Column. any suggestions? def

Register Spark UDF for use with Hive Thriftserver/Beeline

2017-02-28 Thread Lavelle, Shawn
Hello all, I’m trying to make my custom UDFs available from a beeline session via Hive ThriftServer. I’ve been successful in registering them via my DataSourceAPI as it provides the current sqlContext. However, the udfs are not accessible at initial connection, meaning a query won’t parse

Re: DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread Sean Owen
Broadcasts let you send one copy of read only data to each executor. That's not the same as a DataFrame and itseems nature means it doesnt make sense to think of them as not distributed. But consider things like broadcast hash joins which may be what you are looking for if you really mean to join

DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread johndesuv
Hi, I have an application that runs on a series of JVMs that each contain a subset of a large dataset in memory. I'd like to use this data in spark and am looking at ways to use this as a data source in spark without writing the data to disk as a handoff. Parallelize doesn't work for me since I

spark-submit question

2017-02-28 Thread Joe Olson
For spark-submit, I know I can submit application level command line parameters to my .jar. However, can I prefix them with switches? My command line params are processed in my applications using JCommander. I've tried several variations of the below with no success. An example of what I am

How to use ManualClock with Spark streaming

2017-02-28 Thread Hemalatha A
Hi, I am running streaming application reading data from kafka and performing window operations on it. I have a usecase where all incoming events have a fixed latency of 10s, which means data belonging to minute 10:00:00 will arrive 10s late at 10:00:10. I want to set the spark clock to

graph.vertices.collect().foreach(println)

2017-02-28 Thread balaji9058
HI, Any one have idea about this. When i use graph.vertices.collect() in the console(spark console) getting limited vertices data, as i have million records res34: Array[(org.apache.spark.graphx.VertexId, (Array[Double], Array[Double], Double, Double))] =

Re: 答复: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Charles O. Bajomo
Unless this is a managed hive table I would expect you can just MSCK REPAIR the table to get the new partition. of course you will need to change the schema to reflect the new partition Kind Regards From: "Triones,Deng(vip.com)" To: "Charles O. Bajomo"

Re: Run spark machine learning example on Yarn failed

2017-02-28 Thread Jörn Franke
You do not need to place it in every local directory of every node. Just use hadoop fs -put to put it on HDFS. Alternatively as others suggested use s3 > On 28 Feb 2017, at 02:18, Yunjie Ji wrote: > > After start the dfs, yarn and spark, I run these code under the root >

答复: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Triones,Deng(vip.com)
I am writing data to hdfs file, also the hdfs dir is a hive partition file dir. Hive does not support sub dirs.. for example my partition folder is ***/dt=20170224/hm=1400 that means I need to write all the data between 1400 to 1500 to the same folder. 发件人: Charles O. Bajomo

Re: Run spark machine learning example on Yarn failed

2017-02-28 Thread Marco Mistroni
Or place the file in s3 and provide the s3 path Kr On 28 Feb 2017 1:18 am, "Yunjie Ji" wrote: > After start the dfs, yarn and spark, I run these code under the root > directory of spark on my master host: > `MASTER=yarn ./bin/run-example ml.LogisticRegressionExample >

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Femi Anthony
Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ? You can also take a look at this link: https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q Apparently GPDB is based on Postgres so maybe that approach may

Re: Run spark machine learning example on Yarn failed

2017-02-28 Thread Femi Anthony
Have you tried specifying an absolute instead of a relative path ? Femi > On Feb 27, 2017, at 8:18 PM, Yunjie Ji wrote: > > After start the dfs, yarn and spark, I run these code under the root > directory of spark on my master host: > `MASTER=yarn ./bin/run-example

Re: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Charles O. Bajomo
I see this problem as well with the _temporary directory but from what I have been able to gather, there is no way around it in that situation apart from making sure all reducers write to different folders. In the past I partitioned by executor id. I don't know if this is the best way though.

spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Triones,Deng(vip.com)
Hi dev and users Now I am running spark streaming , (spark version 2.0.2) to write file to hdfs. When my spark.streaming.concurrentJobs is more than one. Like 20. I meet the exception as below. We know that when the batch finished, there will be a _SUCCESS file. As I guess