Re: Using saveAsNewAPIHadoopDataset for Saving custom classes to Hbase

2016-04-22 Thread Ted Yu
Which hbase release are you using ? Below is the write method from hbase 1.1 : public void write(KEY key, Mutation value) throws IOException { if (!(value instanceof Put) && !(value instanceof Delete)) { throw new IOException("Pass a Delete or a Put"); }

Using saveAsNewAPIHadoopDataset for Saving custom classes to Hbase

2016-04-22 Thread Nkechi Achara
Hi All, I ma having a few issues saving my data to Hbase. I have created a pairRDD for my custom class using the following: val rdd1 =rdd.map{it=> (getRowKey(it), it) } val job = Job.getInstance(hConf) val jobConf = job.getConfiguration

Re: executor delay in Spark

2016-04-22 Thread Raghava Mutharaju
Thank you. For now we plan to use spark-shell to submit jobs. Regards, Raghava. On Fri, Apr 22, 2016 at 7:40 PM, Mike Hynes <91m...@gmail.com> wrote: > Glad to hear that the problem was solvable! I have not seen delays of this > type for later stages in jobs run by spark-submit, but I do not

RE: Java exception when showing join

2016-04-22 Thread Yong Zhang
use "dispute_df.join(comments_df, dispute_df.COMMENTID === comments_df.COMMENTID).first()" instead. Yong Date: Fri, 22 Apr 2016 17:42:26 -0400 From: webe...@aim.com To: user@spark.apache.org Subject: Java exception when showing join I am using pyspark with netezza. I am getting a java

Re: executor delay in Spark

2016-04-22 Thread Mike Hynes
Glad to hear that the problem was solvable! I have not seen delays of this type for later stages in jobs run by spark-submit, but I do not think it impossible if your stage has no lineage dependence on other RDDs. I'm CC'ing the dev list to report of other users observing load imbalance caused by

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
So is there anyway of creating an rdd without using offsetRanges? Sorry for lack of clarity here val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](sc, kafkaParams, offsetRanges) Dr Mich Talebzadeh LinkedIn *

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
So there is really no point in using it :( Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 23 April

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
The class is private : final class OffsetRange private( On Fri, Apr 22, 2016 at 4:08 PM, Mich Talebzadeh wrote: > Ok I decided to forgo that approach and use an existing program of mine > with slight modification. The code is this > > import

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
Ok I decided to forgo that approach and use an existing program of mine with slight modification. The code is this import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Mich Talebzadeh
Hi Cody, This is my first attempt on using offset ranges (this may not mean much in my context at the moment) val ssc = new StreamingContext(conf, Seconds(10)) ssc.checkpoint("checkpoint") val kafkaParams = Map[String, String]("bootstrap.servers" -> "rhes564:9092", "schema.registry.url" ->

Java exception when showing join

2016-04-22 Thread webe3vt
I am using pyspark with netezza. I am getting a java exception when trying to show the first row of a join. I can show the first row for of the two dataframes separately but not the result of a join. I get the same error for any action I take(first, collect, show). Am I doing something

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
Spark streaming as it exists today is always microbatch. You can certainly filter messages using spark streaming. On Fri, Apr 22, 2016 at 4:02 PM, Mich Talebzadeh wrote: > yep actually using createDirectStream sounds a better way of doing it. Am I > correct that

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Mich Talebzadeh
yep actually using createDirectStream sounds a better way of doing it. Am I correct that createDirectStream was introduced to overcome micro-batching limitations? In a nutshell I want to pickup all the messages and keep signal according to pre-built criteria (say indicating a* buy signal*) and

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
You can still do sliding windows with createDirectStream, just do your map / extraction of fields before the window. On Fri, Apr 22, 2016 at 3:53 PM, Mich Talebzadeh wrote: > Hi Cody, > > I want to use sliding windows for Complex Event Processing micro-batching > > Dr

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Mich Talebzadeh
Hi Cody, I want to use sliding windows for Complex Event Processing micro-batching Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
Why are you wanting to convert? As far as doing the conversion, createStream doesn't take the same arguments, look at the docs. On Fri, Apr 22, 2016 at 3:44 PM, Mich Talebzadeh wrote: > Hi, > > What is the best way of converting this program of that uses >

Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Mich Talebzadeh
Hi, What is the best way of converting this program of that uses KafkaUtils.createDirectStream to Sliding window using val dstream = *KafkaUtils.createDirectStream*[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic) to val dstream = *KafkaUtils.createStream*[String, String,

Executor still on the UI even if the worker is dead

2016-04-22 Thread kundan kumar
Hi Guys, Anyone faced this issue with spark ? Why does it happen so in Spark Streaming that the executors are still shown on the UI even when the worker is killed and not in the cluster. This severely impacts my running jobs which takes too longer and the stages failing with the exception

executor delay in Spark

2016-04-22 Thread Raghava Mutharaju
Mike, It turns out the executor delay, as you mentioned, is the cause. After we introduced a dummy stage, partitioning was working fine. Does this delay happen during later stages as well? We noticed the same behavior (partitioning happens on spark-shell but not through spark-submit) at a later

Re: Spark SQL insert overwrite table not showing all the partition.

2016-04-22 Thread Bijay Kumar Pathak
Hi Zhan, I tried with IF NOT EXISTS clause and still I cannot see the first partition only the partition with last insert overwrite is present in the table. Thanks, Bijay On Thu, Apr 21, 2016 at 11:18 PM, Zhan Zhang wrote: > INSERT OVERWRITE will overwrite any existing

Re: How this unit test passed on master trunk?

2016-04-22 Thread Ted Yu
This was added by Xiao through: [SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star I tried in spark-shell and got: scala> val first = structDf.groupBy($"a").agg(min(struct($"record.*"))).first() first:

MLLib PySpark RandomForest too many features per tree

2016-04-22 Thread flewloon
When I choose my featureSubsetStrategy for the RandomForestModel I set it to sqrt which looks like it should let each decision tree have the sqrt of the total number of features be picked as a feature. I have 900 features so I thought each tree would have ~30 features or less for each tree. When I

How this unit test passed on master trunk?

2016-04-22 Thread Yong Zhang
Hi, I was trying to find out why this unit test can pass in Spark code. inhttps://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala for this unit test: test("Star Expansion - CreateStruct and CreateArray") { val structDf =

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Marcelo Vanzin
On Fri, Apr 22, 2016 at 10:38 AM, Mich Talebzadeh wrote: > I am trying to test Spark with CEP and I have been shown a sample here >

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
Thanks Ted for the info Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 22 April 2016 at 18:38, Ted

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
Hi Marcelo Thanks for your input. I am trying to test Spark with CEP and I have been shown a sample here https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532 It is a long code package

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
Marcelo: >From yesterday's thread, Mich revealed that he was looking at: https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala which references SparkFunSuite. In an earlier thread, Mich was asking about CEP. Just

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Marcelo Vanzin
Sorry, I've been looking at this thread and the related ones and one thing I still don't understand is: why are you trying to use internal Spark classes like Logging and SparkFunSuite in your code? Unless you're writing code that lives inside Spark, you really shouldn't be trying to reference

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
java.lang.IllegalArgumentException: Cannot add dependency 'org.apache.spark#spark-core_2.10;1.5.1' to configuration 'tests' of module cep_assembly#cep_assembly_2.10;1.0 because this configuration doesn't exist! Dr Mich Talebzadeh LinkedIn *

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
For SparkFunSuite , add the following: libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.5.1" % "tests" On Fri, Apr 22, 2016 at 7:20 AM, Mich Talebzadeh wrote: > Trying to build with sbt with the following dependencies > > libraryDependencies +=

modifying a DataFrame column that is a MapType

2016-04-22 Thread williamd1618
Hi, I have a use case where I've loaded data into DataFrame and would like to split the DataFrame into two on some predicate, modify one split DataFrame using withColumn so as to prevent the need to reprocess the data, union the data, and write out to the filesystem. An example of the schema

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-22 Thread Alexander Gallego
Thanks brian. This is basically what I have as well, i just posted the same gist pretty much on the first email: .foreachRDD(rdd => { rdd.foreachPartition(part => { val producer: Producer[String, String] = KafkaWriter.createProducer( brokers) part.foreach(item =>

Re: Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread andrew.rowson
Apologies, outlook for mac is ridiculous. Copy and paste the original below: - I’m running into a strange issue with trying to use a custom Log4j layout for Spark (1.6.1) on YARN (CDH). The layout is: https://github.com/michaeltandy/log4j-json If I use a log4j.properties file (supplied

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
Trying to build with sbt with the following dependencies libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1" % "provided" libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" %

Re: Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread Ted Yu
There is not much in the body of email. Can you elaborate what issue you encountered ? Thanks On Fri, Apr 22, 2016 at 2:27 AM, Rowson, Andrew G. (TR Technology & Ops) < andrew.row...@thomsonreuters.com> wrote: > > > > This e-mail is for the sole use of the

Best practices repartition key

2016-04-22 Thread nihed mbarek
Hi, I'm looking for documentation or best practices about choosing a key or keys for repartition of dataframe or rdd Thank you MBAREK nihed -- M'BAREK Med Nihed, Fedora Ambassador, TUNISIA, Northern Africa http://www.nihed.com

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
Normally Logging would be included in spark-shell session since spark-core jar is imported by default: scala> import org.apache.spark.internal.Logging import org.apache.spark.internal.Logging See this JIRA: [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging In

Run Apache Spark on EMR

2016-04-22 Thread Jinan Alhajjaj
Hi AllI would like to ask for two thing and I really appreciate the answer ASAP1. How do I implement the parallelism in Apache Spark java application?2. How to run the Spark application in Amazon EMR?

Re: question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread Petr Novak
Hi, I understand it just as that they will provide some lower latency interface and probably using jdbc so that 3rd party BI tools can integrate and query streams like they would be static datasets. If BI will repeat the query it will be updated. I don't know if BI tools are already heading

question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread charles li
hi, there, the talk *The Future of Real Time in Spark* here https://www.youtube.com/watch?v=oXkxXDG0gNk tells that there will be "BI app integration" on 24:28 of the video. what does he mean the *BI app integration* in that talk? does that mean that they will develop a BI tool like zeppelin,

Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread Rowson, Andrew G. (TR Technology & Ops)
This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain

Re: pyspark EOFError after calling map

2016-04-22 Thread Pete Werner
Oh great, thank you for clearing that up. On Fri, Apr 22, 2016 at 5:15 PM, Davies Liu wrote: > This exception is already handled well, just noisy, should be muted. > > On Wed, Apr 13, 2016 at 4:52 PM, Pete Werner > wrote: > >> Hi >> >> I am new to

Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Mich Talebzadeh
Hi, Anyone know which jar file has import org.apache.spark.internal.Logging? I tried *spark-core_2.10-1.5.1.jar * but does not seem to work scala> import org.apache.spark.internal.Logging :57: error: object internal is not a member of package org.apache.spark import

Re: pyspark EOFError after calling map

2016-04-22 Thread Davies Liu
This exception is already handled well, just noisy, should be muted. On Wed, Apr 13, 2016 at 4:52 PM, Pete Werner wrote: > Hi > > I am new to spark & pyspark. > > I am reading a small csv file (~40k rows) into a dataframe. > > from pyspark.sql import functions as F > df

Re: Save DataFrame to HBase

2016-04-22 Thread Zhan Zhang
You can try this https://github.com/hortonworks/shc.git or here http://spark-packages.org/package/zhzhan/shc Currently it is in the process of merging into HBase. Thanks. Zhan Zhang On Apr 21, 2016, at 8:44 AM, Benjamin Kim > wrote: Hi Ted,

Re: Spark SQL insert overwrite table not showing all the partition.

2016-04-22 Thread Zhan Zhang
INSERT OVERWRITE will overwrite any existing data in the table or partition * unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0). Thanks. Zhan Zhang On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak

Re: Spark DataFrame sum of multiple columns

2016-04-22 Thread Divya Gehlot
Easy way of doing it newdf = df.withColumn('total', sum(df[col] for col in df.columns)) On 22 April 2016 at 11:51, Naveen Kumar Pokala wrote: > Hi, > > > > Do we have any way to perform Row level operations in spark dataframes. > > > > > > For example, > > > > I have

Re: Spark DataFrame sum of multiple columns

2016-04-22 Thread Zhan Zhang
You can define your own udf, following is one example Thanks Zhan Zhang val foo = udf((a: Int, b: String) => a.toString + b) checkAnswer( // SELECT *, foo(key, value) FROM testData testData.select($"*", foo('key, 'value)).limit(3), On Apr 21, 2016, at 8:51 PM, Naveen Kumar Pokala