[no subject]

2016-06-24 Thread Rama Perubotla
Unsubscribe

Re: Running JavaBased Implementation of StreamingKmeans Spark

2016-06-24 Thread Jayant Shekhar
Hi Biplop, Can you try adding new files to the training/test directories after you have started your streaming application! Especially the test directory as you are printing your predictions. On Fri, Jun 24, 2016 at 2:32 PM, Biplob Biswas wrote: > > Hi, > > I

Poor performance of using spark sql over gzipped json files

2016-06-24 Thread Shuai Lin
Hi, We have tried to use spark sql to process some gzipped json-format log files stored on S3 or HDFS. But the performance is very poor. For example, here is the code that I run over 20 gzipped files (total size of them is 4GB compressed and ~40GB when decompressed) gzfile =

Re: How can I use pyspark.ml.evaluation.BinaryClassificationEvaluator with point predictions instead of confidence intervals?

2016-06-24 Thread apu
SOLVED. The rawPredictionCol input to BinaryClassificationEvaluator is a vector specifying the prediction confidence for each class. Since we are talking about binary classification the prediction for class 0 is simply (1 - y_pred), where y_pred is the prediction for class 1. So this can be

Spark 2.0 Continuous Processing

2016-06-24 Thread kmat
Is there a way to checkpoint sink(s) to facilitate rewind processing from a specific offset. For example a continuous query aggregated by month. On the 10 month would like to re-compute information between 4th to 8th months. -- View this message in context:

Batch details are missing

2016-06-24 Thread C. Josephson
We're trying to resolve some performance issues with spark streaming using the application UI, but the batch details page doesn't seem to be working. When I click on a batch in the streaming application UI, I expect to see something like this: http://i.stack.imgur.com/ApF8z.png But instead we see

Running JavaBased Implementation of StreamingKmeans Spark

2016-06-24 Thread Biplob Biswas
Hi, I implemented the streamingKmeans example provided in the spark website but in Java. The full implementation is here, http://pastebin.com/CJQfWNvk But i am not getting anything in the output except occasional timestamps like one below: ---

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
I was able to resolve the serialization issue. The root cause was, I was accessing the config values within foreachRDD{}. The solution was to extract the values from config outside the foreachRDD scope and send in values to the loop directly. Probably something obvious as we cannot have nested

Re: Logging trait in Spark 2.0

2016-06-24 Thread Jonathan Kelly
Ted, how is that thread related to Paolo's question? On Fri, Jun 24, 2016 at 1:50 PM Ted Yu wrote: > See this related thread: > > > http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+ > > On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno

Model Quality Tracking

2016-06-24 Thread Benjamin Kim
Has anyone implemented a way to track the performance of a data model? We currently have an algorithm to do record linkage and spit out statistics of matches, non-matches, and/or partial matches with reason codes of why we didn’t match accurately. In this way, we will know if something goes

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Xinh Huynh
Hi Martin, Since your schema is dynamic, how would you use Datasets? Would you know ahead of time the row type T in a Dataset[T]? One option is to start with DataFrames in the beginning of your data pipeline, figure out the field types, and then switch completely over to RDDs or Dataset in the

Re: Logging trait in Spark 2.0

2016-06-24 Thread Ted Yu
See this related thread: http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+ On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno wrote: > Hi, > > developing a Spark Streaming custom receiver I noticed that the Logging > trait isn't accessible

Unsubscribe

2016-06-24 Thread R. Revert
El jun. 24, 2016 1:55 PM, "Steve Florence" escribió: > >

Unsubscribe

2016-06-24 Thread Steve Florence

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano
Indeed. But I'm dealing with 1.6 for now unfortunately. On 06/24/2016 02:30 PM, Ted Yu wrote: In Spark 2.0, Dataset and DataFrame are unified. Would this simplify your use case ? On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano > wrote: Hi, I'm

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Ted Yu
In Spark 2.0, Dataset and DataFrame are unified. Would this simplify your use case ? On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano wrote: > Hi, > > I'm exposing a custom source to the Spark environment. I have a question > about the best way to approach this problem. > >

How can I use pyspark.ml.evaluation.BinaryClassificationEvaluator with point predictions instead of confidence intervals?

2016-06-24 Thread apu
pyspark.ml.evaluation.BinaryClassificationEvaluator expects predictions in the form of vectors (apparently designating confidence intervals), as described in https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator However, I am trying to

Spark connecting to Hive in another EMR cluster

2016-06-24 Thread Dave Maughan
Hi, We're trying to get a Spark (1.6.1) job running on EMR (4.7.1) that's connecting to the Hive metastore in another EMR cluster. A simplification of what we're doing is below val sparkConf = new SparkConf().setAppName("MyApp") val sc = new SparkContext(sparkConf) val sqlContext = new

Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
Unless I'm misreading the image you posted, it does show event counts for the single batch that is still running, with 1.7 billion events in it. The recent batches do show 0 events, but I'm guessing that's because they're actually empty. When you said you had a kafka topic with 1.7 billion

DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano
Hi, I'm exposing a custom source to the Spark environment. I have a question about the best way to approach this problem. I created a custom relation for my source and it creates a DataFrame. My custom source knows the data types which are dynamic so this seemed to be the appropriate return

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger
That looks like a classpath problem. You should not have to include the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10 already has a transitive dependency on it. That being said, 0.8.2.1 is the correct version, so that's a little strange. How are you building and submitting your

streaming on yarn

2016-06-24 Thread Alex Dzhagriev
Hello, Can someone, please, share the opinions on the options available for running spark streaming jobs on yarn? The first thing comes to my mind is to use slider. Googling for such experience didn't give me much. From my experience running the same jobs on mesos, I have two concerns: automatic

Logging trait in Spark 2.0

2016-06-24 Thread Paolo Patierno
Hi, developing a Spark Streaming custom receiver I noticed that the Logging trait isn't accessible anymore in Spark 2.0. trait Logging in package internal cannot be accessed in package org.apache.spark.internal For developing a custom receiver what is the preferred way for logging ? Just

Re: Performance issue with spark ml model to make single predictions on server side

2016-06-24 Thread Nick Pentreath
Currently, spark-ml models and pipelines are only usable in Spark. This means you must use Spark's machinery (and pull in all its dependencies) to do model serving. Also currently there is no fast "predict" method for a single Vector instance. So for now, you are best off going with PMML, or

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke
Yes yes true. I just wonder if somebody took measurements for all different types of problems in the Big Data area and created some scientific analysis how much time is wasted on serialization deserialization to support the figure of 80% ;) > On 24 Jun 2016, at 10:35, Jacek Laskowski

Spark Xml schema help

2016-06-24 Thread Nandan Thakur
Hi All, I have been using spark-xml in one of my project to process some xml files. but when I don't provide custom schema, this jar automatically generate following schema:- root |-- UserData: struct (nullable = true) ||-- UserValue: array

Re: problem running spark with yarn-client not using spark-submit

2016-06-24 Thread Mich Talebzadeh
Hi, Trying to run spark with yarn-client not using spark-submit here what are you using to submit the job? spark-shell, spark-sql or anything else Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

How to convert a Random Forest model built in R to a similar model in Spark

2016-06-24 Thread Neha Mehta
Hi Sun, I am trying to build a model in Spark. Here are the parameters which were used in R for creating the model, I am unable to figure out how to specify a similar input to the random forest regressor in Spark so that I get a similar model in Spark. https

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Mich Talebzadeh
Hi Punneet, File does not exist: hdfs://localhost:8020/user/opc/.sparkStaging/application_1466711725829_0033/pipeline-lib-0.1.0-SNAPSHOT.jar indicates a YARN issue. It is trying to get that file from HDFS and copy it across to /tmp directory. 1. Check that the class is actually created at

problem running spark with yarn-client not using spark-submit

2016-06-24 Thread sychungd
Hello guys, Trying to run spark with yarn-client not using spark-submit here but the jobs kept failed while AM launching executor. The error collected by yarn like below. Looks like some environment setting is missing? Could someone help me out with this. Thanks in advance! HY Chung Java

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Jeff Zhang
You might have multiple java servlet jars on your classpath. On Fri, Jun 24, 2016 at 3:31 PM, Mich Talebzadeh wrote: > can you please check the yarn log files to see what they say (both the > nodemamager and resourcemanager) > > HTH > > Dr Mich Talebzadeh > > > >

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Pranav Nakhe
Hello, The question came from the point that dataframe uses tungsten improvements with usage of catalyst optimizer. So there would be some additional work spark does to convert an RDD to dataframe to use the optimizations/improvements available to dataframes. Regards, Pranav On Fri, Jun 24,

Re: Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-24 Thread Jacek Laskowski
Hi Mirko, What exactly was the setting? I'd like to reproduce it. Can you file an issue in JIRA to fix that? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri,

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-24 Thread Radha krishna
Hi, i have some files in the hdfs with FS as field separator and RS as record separator, i am able to read the files and able to process successfully. how can i write the spark DataFrame result into the HDFS file with same delimeters (FS as field separator and RS as record separator instead of \n)

Re: Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-24 Thread Mirko
Hi, Many thanks for the suggestions. I discovered that the problem was related on a missing driver definition in the jdbc options map. The error wasn’t really helpful to understand that! Cheers, Mirko On 22 Jun 2016, at 18:11, markcitizen [via Apache Spark User List]

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jacek Laskowski
Hi Jorn, You can measure the time for ser/deser yourself using web UI or SparkListeners. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jun 24, 2016 at

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke
I would push the Spark people to provide equivalent functionality . In the end it is a deserialization/serialization process which should not be done back and forth because it is one of the more costly aspects during processing. It needs to convert Java objects to a binary representation. It is

Re: categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp
i want add informations when i created this dict i fllow this step : 1- i create list of all my variable : liste double liste int liste categoriel variable all categoriel variable it s int typed 2 i create al = listdouble+listint+listcateg : command

categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp
Hi, how i can keep type of my variable like int because i get this error when i call random forest algorithm with model = RandomForest.trainClassifier(rdf, numClasses=2, categoricalFeaturesInfo=d,

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Mich Talebzadeh
can you please check the yarn log files to see what they say (both the nodemamager and resourcemanager) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Mich Talebzadeh
Hi, I do not profess at all that this this reply has any correlation with the advanced people :) However, in general a Data Frame adds the two-dimensional structure (table) to RDD which is basically a construct that cannot be optimised due to non-schema structure of RDD. Now converting RDD to

Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread puneet kumar
I am getting below error thrown when I submit Spark Job using Spark Submit on Yarn. Need a quick help on what's going wrong here. 16/06/24 01:09:25 WARN AbstractLifeCycle: FAILED org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter-791eb5d5: java.lang.IllegalStateException: class

Re: Partitioning in spark

2016-06-24 Thread Darshan Singh
Thanks but the whole point is not setting it explicitly but it should be derived from its parent RDDS. Thanks On Fri, Jun 24, 2016 at 6:09 AM, ayan guha wrote: > You can change paralllism like following: > > conf = SparkConf() > conf.set('spark.sql.shuffle.partitions',10)

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jacek Laskowski
Hi, I've been asking a similar question myself too! Thanks for sending it to the mailing list! Going from a RDD to a Dataset triggers a job to calculate a schema (unless the RDD is RDD[Row]). I *think* that transitioning from a Dataset to a RDD is almost a no op since a Dataset requires more to

Cost of converting RDD's to dataframe and back

2016-06-24 Thread pan
Hello, I am trying to understand the cost of converting an RDD to Dataframe and back. Would a conversion back and forth very frequently cost performance. I do observe that some operations like join are implemented very differently for RDD (pair) and Dataframe so trying to figure out the cose