Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
One thing you should be aware of (that's a showstopper for my use cases, but may not be for yours) is that you can provide Kafka offsets to start from, but you can't really get access to offsets and metadata during the job on a per-batch or per-partition basis, just on a per-message basis. On

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
Yeah, those are all requests for additional features / version support. I've been using kafka with structured streaming to do both ETL into partitioned parquet tables as well as streaming event time windowed aggregation for several weeks now. On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
Look at the resolved subtasks attached to that ticket you linked. Some of them are unresolved, but basic functionality is there. On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande wrote: > Hi Michael, > > Thanks for the reply. > > The following link says there is a open

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
Hi Michael, Thanks for the reply. The following link says there is a open unresolved Jira for Structured streaming support for consuming from Kafka. https://issues.apache.org/jira/browse/SPARK-15406 Appreciate your help. -Shyla On Tue, Nov 1, 2016 at 5:19 PM, Michael Armbrust

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
I'm not aware of any open issues against the kafka source for structured streaming. On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande wrote: > I am building a data pipeline using Kafka, Spark streaming and Cassandra. > Wondering if the issues with Kafka source fixed in

Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
I am building a data pipeline using Kafka, Spark streaming and Cassandra. Wondering if the issues with Kafka source fixed in Spark 2.0.1. If not, please give me an update on when it may be fixed. Thanks -Shyla

not table to connect to table using hiveContext

2016-11-01 Thread vinay parekar
Hi there, I am trying to get some table data using spark hiveContext. I am getting an exception as : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table rnow_imports_text. null at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1158) at

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Cool! So going back to IDF Estimator and Model problem, do you know what an IDF estimator really does during Fitting process? It must be storing some state (information) as I mentioned in OP (|D|, DF|t, D| and perhaps TF|t, D|) that it re-uses to Transform test data (labeled data). Or does it

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
Yes, that is correct. I think I misread a part of it in terms of scoringI think we both are saying same thing so thats a good thing :) On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel wrote: > Hi Ayan, > > "classification algorithm will for sure need to Fit against new

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Hi Ayan, "classification algorithm will for sure need to Fit against new dataset to produce new model" I said this in context of re-training the model. Is it not correct? Isn't it part of re-training? Thanks On Tue, Nov 1, 2016 at 4:01 PM, ayan guha wrote: > Hi > >

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
Hi "classification algorithm will for sure need to Fit against new dataset to produce new model" - I do not think this is correct. Maybe we are talking semantics but AFAIU, you "train" one model using some dataset, and then use it for scoring new datasets. You may re-train every month, yes. And

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust
registerTempTable is backed by an in-memory hash table that maps table name (a string) to a logical query plan. Fragments of that logical query plan may or may not be cached (but calling register alone will not result in any materialization of results). In Spark 2.0 we renamed this function to

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Looks like upgrading to Spark 2.0.1 fixed it! The thread count now when I do cat /proc/pid/status is about 84 as opposed to a 1000 in the span of 2 mins in Spark 2.0.0 On Tue, Nov 1, 2016 at 11:40 AM, Shixiong(Ryan) Zhu wrote: > Yes, try 2.0.1! > > On Tue, Nov 1, 2016

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Hi Ayan, After deployment, we might re-train it every month. That is whole different problem I have explored yet. classification algorithm will for sure need to Fit against new dataset to produce new model. Correct me if I am wrong but I think I will also FIt new IDF model based on new dataset. At

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Yes, try 2.0.1! On Tue, Nov 1, 2016 at 11:25 AM, kant kodali wrote: > AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 > > On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Dstream "Window" uses "union" to combine multiple

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu wrote: > Dstream "Window" uses "union" to combine multiple RDDs in one window into > a single RDD. > > On Tue, Nov 1, 2016 at 2:59 AM kant kodali

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Dstream "Window" uses "union" to combine multiple RDDs in one window into a single RDD. On Tue, Nov 1, 2016 at 2:59 AM kant kodali wrote: > @Sean It looks like this problem can happen with other RDD's as well. Not > just unionRDD > > On Tue, Nov 1, 2016 at 2:52 AM, kant

Re: Deep learning libraries for scala

2016-11-01 Thread Benjamin Kim
To add, I see that Databricks has been busy integrating deep learning more into their product and put out a new article about this. https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html An

Re: GraphFrame BFS

2016-11-01 Thread Denny Lee
You should be able to GraphX or GraphFrames subgraph to build up your subgraph. A good example for GraphFrames can be found at: http://graphframes.github.io/user-guide.html#subgraphs. HTH! On Mon, Oct 10, 2016 at 9:32 PM cashinpj wrote: > Hello, > > I have a set of data

Application remains in WAITING state after Master election

2016-11-01 Thread Alexis Seigneurin
Hi, I am running a Spark standalone cluster with 2 masters: one active, the other in standby. An application is running on this cluster. When the active master dies, the standby master becomes active and the running application reconnects to the newly active master. The only problem I see is

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
If you are using local mode then there is only one JVM. In Linux as below mine looks like this ${SPARK_HOME}/bin/spark-submit \ --packages ${PACKAGES} \ --driver-memory 8G \ --num-executors 1 \ --executor-memory 8G \ *--master

RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Yes, exactly. My (testing) run script is: spark-submit --class com.infor.skyvault.tests.LinearRegressionTest --master local C:\_resources\spark-1.0-SNAPSHOT.jar -DtrainDataPath="/path/to/model/data" From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Tuesday, November 1, 2016 2:51

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
Are you submitting your job through spark-submit? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Hello, This approach unfortunately doesn’t work for job submission for me. It works in the shell, but not when submitted. I ensured the (only worker) node has desired directory. Neither specifying all jars as you suggested, neither using /path/to/jarfiles/* works. Could you verify, that using

Re: Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Sean Owen
CrossValidator splits the data into k sets, and then trains k times, holding out one subset for cross-validation each time. You are correct that you should actually withhold an additional test set, before you use CrossValidator, in order to get an unbiased estimate of the best model's performance.

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
you can do that as long as every node has the directory referenced. For example spark.driver.extraClassPath /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar spark.executor.extraClassPath /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar this will work as long as all nodes have

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Vinod Mangipudi
unsubscribe On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek wrote: > Thank you for the reply. > > I am aware of the parameters used when submitting the tasks (--jars is > working for us). > > > > But, isn’t there any way how to specify a location (directory) for jars > „in

RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Thank you for the reply. I am aware of the parameters used when submitting the tasks (--jars is working for us). But, isn’t there any way how to specify a location (directory) for jars „in global“ - in the spark-defaults.conf?? From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday,

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread ayan guha
There are options to specify external jars in the form of --jars, --driver-classpath etc depending on spark version and cluster manager.. Please see spark documents for configuration sections and/or run spark submit help to see available options. On 1 Nov 2016 23:13, "Jan Botorek"

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
I have come across similar situation recently and decided to run Training workflow less frequently than scoring workflow. In your use case I would imagine you will run IDF fit workflow once in say a week. It will produce a model object which will be saved. In scoring workflow, you will typically

Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Hello, I have a problem trying to add jar files to be available on classpath when submitting task to Spark. In my spark-defaults.conf file I have configuration: spark.driver.extraClassPath = path/to/folder/with/jars all jars in the folder are available in SPARK-SHELL The problem is that jars

Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Nirav Patel
I am running classification model. with normal training-test split I can check model accuracy and F1 score using MulticlassClassificationEvaluator. How can I do this with CrossValidation approach? Afaik, you Fit entire sample data in CrossValidator as you don't want to leave out any observation

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Yes, I do apply NaiveBayes after IDF . " you can re-train (fit) on all your data before applying it to unseen data." Did you mean I can reuse that model to Transform both training and test data? Here's the process: Datasets: 1. Full sample data (labeled) 2. Training (labeled) 3. Test

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
Fit it on training data to evaluate the model. You can either use that model to apply to unseen data or you can re-train (fit) on all your data before applying it to unseen data. fit and transform are 2 different things: fit creates a model, transform applies a model to data to create

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Just to re-iterate what you said, I should fit IDF model only on training data and then re-use it for both test data and then later on unseen data to make predictions. On Tue, Nov 1, 2016 at 3:49 AM, Robin East wrote: > The point of setting aside a portion of your data

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
it would be great if we establish this. I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are private to the session and are put in a hidden staging directory as below /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-47_319_5605745346163312826-10 and removed when the

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Thanks for the link, I hadn't come across this. According to https://forums.databricks.com/questions/400/what-is-the- > difference-between-registertemptable-a.html > > and I quote > > "registerTempTable() > > registerTempTable() creates an in-memory table that is scoped to the > cluster in which

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
The point of setting aside a portion of your data as a test set is to try and mimic applying your model to unseen data. If you fit your IDF model to all your data, any evaluation you perform on your test set is likely to over perform compared to ‘real’ unseen data. Effectively you would have

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
A bit of gray area here I am afraid, I was trying to experiment with it According to https://forums.databricks.com/questions/400/what-is-the-difference-between-registertemptable-a.html and I quote "registerTempTable() registerTempTable() creates an in-memory table that is scoped to the cluster

Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
FYI, I do reuse IDF model while making prediction against new unlabeled data but not between training and test data while training a model. On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel wrote: > I am using IDF estimator/model (TF-IDF) to convert text features into >

Is IDF model reusable

2016-11-01 Thread Nirav Patel
I am using IDF estimator/model (TF-IDF) to convert text features into vectors. Currently, I fit IDF model on all sample data and then transform them. I read somewhere that I should split my data into training and test before fitting IDF model; Fit IDF only on training data and then use same

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Hi again Mich, "But the thing is that I don't explicitly cache the tempTables ..". > > I believe tempTable is created in-memory and is already cached > That surprises me since there is a sqlContext.cacheTable method to explicitly cache a table in memory. Or am I missing something? This could

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
@Sean It looks like this problem can happen with other RDD's as well. Not just unionRDD On Tue, Nov 1, 2016 at 2:52 AM, kant kodali wrote: > Hi Sean, > > The comments seem very relevant although I am not sure if this pull > request https://github.com/apache/spark/pull/14985

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Hi Sean, The comments seem very relevant although I am not sure if this pull request https://github.com/apache/spark/pull/14985 would fix my issue? I am not sure what unionRDD.scala has anything to do with my error (I don't know much about spark code base). Do I ever use unionRDD.scala when I

Re: Python - Spark Cassandra Connector on DC/OS

2016-11-01 Thread Andrew Holway
Sorry: Spark 2.0.0 On Tue, Nov 1, 2016 at 10:04 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > I've been getting pretty serious with DC/OS which I guess could be > described as a somewhat polished distribution of Mesos. I'm not sure how > relevant DC/OS is to this

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Sean Owen
Possibly https://issues.apache.org/jira/browse/SPARK-17396 ? On Tue, Nov 1, 2016 at 2:11 AM kant kodali wrote: > Hi Ryan, > > I think you are right. This may not be related to the Receiver. I have > attached jstack dump here. I do a simple MapToPair and reduceByKey and I >

Python - Spark Cassandra Connector on DC/OS

2016-11-01 Thread Andrew Holway
Hello, I've been getting pretty serious with DC/OS which I guess could be described as a somewhat polished distribution of Mesos. I'm not sure how relevant DC/OS is to this problem. I am using this pyspark program to test the cassandra connection: http://bit.ly/2eWAfxm (github) I can that the

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
This question looks very similar to mine but I don't see any answer. http://markmail.org/message/kkxhi5jjtwyadzxt On Mon, Oct 31, 2016 at 11:24 PM, kant kodali wrote: > Here is a UI of my thread dump. > > http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYv >

Addition of two SparseVector

2016-11-01 Thread Yan Facai
Hi, all. How can I add a Vector to another one? scala> val a = Vectors.sparse(20, Seq((1,1.0), (2,2.0))) a: org.apache.spark.ml.linalg.Vector = (20,[1,2],[1.0,2.0]) scala> val b = Vectors.sparse(20, Seq((2,2.0), (3,3.0))) b: org.apache.spark.ml.linalg.Vector = (20,[2,3],[2.0,3.0]) scala> a + b

Spark Job Failed with FileNotFoundException

2016-11-01 Thread fanooos
I have a spark cluster consists of 5 nodes and I have a spark job that should process some files from a directory and send its content to Kafka. I am trying to submit the job using the following command bin$ ./spark-submit --total-executor-cores 20 --executor-memory 5G --class

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Here is a UI of my thread dump. http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdG Fja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXz FzLnR4dC0tNi0xNy00Ng== On Mon, Oct 31, 2016 at 7:10 PM, kant kodali wrote: > Hi Ryan, > > I think you are

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-11-01 Thread kant kodali
Here is a UI of my thread dump. http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdGFja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXzFzLnR4dC0tNi0xNy00Ng== On Mon, Oct 31, 2016 at 10:32 PM, kant kodali wrote: > Hi Vadim, > > Thank you so