Re: PySpark Pandas UDF

2019-11-17 Thread Bryan Cutler
There was a change in the binary format of Arrow 0.15.1 and there is an environment variable you can set to make pyarrow 0.15.1 compatible with current Spark, which looks to be your problem. Please see the doc below for instructions added in SPARK-2936. Note, this will not be required for the

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-12 Thread Bryan Cutler
, 2019 at 6:08 PM Hyukjin Kwon wrote: >> > >> > +1 >> > >> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: >> >> >> >> Sounds reasonable to me. We should make the behavior consistent within >> Spark. >> >> >> >>

[DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-04 Thread Bryan Cutler
Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically. Then later when applying a schema and the

Re: question about pyarrow.Table to pyspark.DataFrame conversion

2019-09-10 Thread Bryan Cutler
Hi Artem, I don't believe this is currently possible, but it could be a great addition to PySpark since this would offer a convenient and efficient way to parallelize nested column data. I created the JIRA https://issues.apache.org/jira/browse/SPARK-29040 for this. On Tue, Aug 27, 2019 at 7:55

Re: Usage of PyArrow in Spark

2019-07-18 Thread Bryan Cutler
It would be possible to use arrow on regular python udfs and avoid pandas, and there would probably be some performance improvement. The difficult part will be to ensure that the data remains consistent in the conversions between Arrow and Python, e.g. timestamps are a bit tricky. Given that we

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Bryan Cutler
+1 and the draft sounds good On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: > Here is the draft announcement: > > === > Plan for dropping Python 2 support > > As many of you already knew, Python core development team and many > utilized Python packages like Pandas and NumPy will drop

Re: pySpark - pandas UDF and binaryType

2019-05-02 Thread Bryan Cutler
Hi, BinaryType support was not added until Spark 2.4.0, see https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater is require as you saw in the docs. Bryan On Thu, May 2, 2019 at 4:26 AM Nicolas Paris wrote: > Hi all > > I am using pySpark 2.3.0 and pyArrow 0.10.0

Re: spark2.4 arrow enabled true,error log not returned

2019-01-10 Thread Bryan Cutler
Hi, could you please clarify if you are running a YARN cluster when you see this problem? I tried on Spark standalone and could not reproduce. If it's on a YARN cluster, please file a JIRA and I can try to investigate further. Thanks, Bryan On Sat, Dec 15, 2018 at 3:42 AM 李斌松 wrote: >

Re: Use Arrow instead of Pickle without pandas_udf

2018-07-30 Thread Bryan Cutler
Here is a link to the JIRA for adding StructType support for scalar pandas_udf https://issues.apache.org/jira/browse/SPARK-24579 On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi wrote: > Hey Holden, > Thanks for your reply, > > We currently using a python function that produces a

Re: Arrow type issue with Pandas UDF

2018-07-19 Thread Bryan Cutler
Hi Patrick, It looks like it's failing in Scala before it even gets to Python to execute your udf, which is why it doesn't seem to matter what's in your udf. Since you are doing a grouped map udf maybe your group sizes are too big or skewed? Could you try to reduce the size of your groups by

Re: Pandas UDF for PySpark error. Big Dataset

2018-05-29 Thread Bryan Cutler
Can you share some of the code used, or at least the pandas_udf plus the stacktrace? Also does decreasing your dataset size fix the oom? On Mon, May 28, 2018, 4:22 PM Traku traku wrote: > Hi. > > I'm trying to use the new feature but I can't use it with a big dataset > (about 5 million rows).

Re: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql.Dataset.withColumns

2018-05-18 Thread Bryan Cutler
The example works for me, please check your environment and ensure you are using Spark 2.3.0 where OneHotEncoderEstimator was introduced. On Fri, May 18, 2018 at 12:57 AM, Matteo Cossu wrote: > Hi, > > are you sure Dataset has a method withColumns? > > On 15 May 2018 at

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-16 Thread Bryan Cutler
Yes, the workaround is to create multiple StringIndexers as you described. OneHotEncoderEstimator is only in Spark 2.3.0, you will have to use just OneHotEncoder. On Tue, May 15, 2018, 8:40 AM Mina Aslani wrote: > Hi, > > So, what is the workaround? Should I create

Re: [Arrow][Dremio]

2018-05-15 Thread Bryan Cutler
Hi Xavier, Regarding Arrow usage in Spark, using Arrow format to transfer data between Python and Java has been the focus so far because this area stood to benefit the most. It's possible that the scope of Arrow could broaden in the future, but there still needs to be discussions about this.

Re: Spark dataset to byte array over grpc

2018-04-23 Thread Bryan Cutler
Hi Ashwin, This sounds like it might be a good use for Apache Arrow, if you are open to the type of format to exchange. As of Spark 2.3, Dataset has a method "toArrowPayload" that will convert a Dataset of Rows to a byte array in Arrow format, although the API is currently not public. Your

Re: PySpark ML: Get best set of parameters from TrainValidationSplit

2018-04-16 Thread Bryan Cutler
Hi Aakash, First you will want to get the the random forest model stage from the best pipeline model result, for example if RF is the first stage: rfModel = model.bestModel.stages[0] Then you can check the values of the params you tuned like this: rfModel.getNumTrees On Mon, Apr 16, 2018 at

Re: is there a way of register python UDF using java API?

2018-04-02 Thread Bryan Cutler
Hi Kant, The udfDeterministic would be set to false if the results from your UDF are non-deterministic, such as produced by random numbers, so the catalyst optimizer will not cache and reuse results. On Mon, Apr 2, 2018 at 12:11 PM, kant kodali wrote: > Looks like there is

Re: PySpark Tweedie GLM

2018-02-09 Thread Bryan Cutler
Can you provide some code/data to reproduce the problem? On Fri, Feb 9, 2018 at 9:42 AM, nhamwey wrote: > I am using Spark 2.2.0 through Python. > > I am repeatedly getting a zero weight of sums error when trying to run a > model. This happens even when I do not

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-02-08 Thread Bryan Cutler
e, the demo code spark-2.2.1-bin-hadoop2.7/ > examples/src/main/python/ml/estimator_transformer_param_example.py > return empty parameters when printing "lr.extractParamMap()" > > That's weird > > Thanks > > Le 30 janv. 2018 à 23:10, Bryan Cutler écrivait : &g

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-01-30 Thread Bryan Cutler
Hi Michelle, Your original usage of ParamGridBuilder was not quite right, `addGrid` expects (some parameter, array of values for that parameter). If you want to do a grid search with different regularization values, you would do the following: val paramMaps = new

Re: Timestamp changing while writing

2018-01-15 Thread Bryan Cutler
Spark internally stores timestamps as UTC values, so cearteDataFrame will covert from local time zone to UTC. I think there was a Jira to correct parquet output. Are the values you are seeing offset from your local time zone? On Jan 11, 2018 4:49 PM, "sk skk" wrote: >

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Bryan Cutler
Hi Prem, Spark actually does somewhat support different algorithms in CrossValidator, but it's not really obvious. You basically need to make a Pipeline and build a ParamGrid with different algorithms as stages. Here is an simple example: val dt = new DecisionTreeClassifier()

Re: Crossvalidator after fit

2017-05-05 Thread Bryan Cutler
Looks like there might be a problem with the way you specified your parameter values, probably you have an integer value where it should be a floating-point. Double check that and if there is still a problem please share the rest of your code so we can see how you defined "gridS". On Fri, May 5,

Re: pandas DF Dstream to Spark DF

2017-04-10 Thread Bryan Cutler
Hi Yogesh, It would be easier to help if you included your code and the exact error messages that occur. If you are creating a Spark DataFrame with a Pandas DataFrame, then Spark does not read the schema and infers from the data to make one. This might be the cause of your issue if the schema

Re: Belief propagation algorithm is open sourced

2016-12-14 Thread Bryan Cutler
I'll check it out, thanks for sharing Alexander! On Dec 13, 2016 4:58 PM, "Ulanov, Alexander" wrote: > Dear Spark developers and users, > > > HPE has open sourced the implementation of the belief propagation (BP) > algorithm for Apache Spark, a popular message passing

Re: New to spark.

2016-09-28 Thread Bryan Cutler
Hi Anirudh, All types of contributions are welcome, from code to documentation. Please check out the page at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for some info, specifically keep a watch out for starter JIRAs here

Re: Master OOM in "master-rebuild-ui-thread" while running stream app

2016-09-13 Thread Bryan Cutler
It looks like you have logging enabled and your application event log is too large for the master to build a web UI from it. In spark 1.6.2 and earlier, when an application completes, the master rebuilds a web UI to view events after the fact. This functionality was removed in spark 2.0 and the

Re: Random Forest Classification

2016-08-31 Thread Bryan Cutler
ce it will neither have the "feature" column > and nor the 5 indexed columns. > Of-course there is a dirty way of doing this, but I am wondering if there > some optimized/intelligent approach for this. > > Thanks, > Baahu > > On Wed, Aug 31, 2016 at 3:30 AM, Bryan C

Re: Random Forest Classification

2016-08-30 Thread Bryan Cutler
.setMaxCategories(180) > > val decisionTree = new DecisionTreeClassifier(). > setMaxBins(300).setMaxDepth(1).setImpurity("entropy"). > setLabelCol("indexed_user_action").setFeaturesCol("indexedfeature"). > setPredictionCol("prediction

Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Bryan Cutler
You will need to cast bestModel to include the MLWritable trait. The class Model does not mix it in by default. For instance: cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path") Alternatively, you could save the CV model directly, which takes care of this cvModel.save("/my/path") On

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care

Re: Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-28 Thread Bryan Cutler
That's the correct fix. I have this done along with a few other Java examples that still use the old MLlib Vectors in this PR thats waiting for review https://github.com/apache/spark/pull/14308 On Jul 28, 2016 5:14 AM, "Robert Goodman" wrote: > I changed import in the sample

Re: Programmatic use of UDFs from Java

2016-07-22 Thread Bryan Cutler
Everett, I had the same question today and came across this old thread. Not sure if there has been any more recent work to support this. http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html On Thu, Jul 21, 2016 at 10:10 AM, Everett

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
ML has a DataFrame based API, while MLlib is RDDs and will be deprecated as of Spark 2.0. On Thu, Jul 21, 2016 at 10:41 PM, VG <vlin...@gmail.com> wrote: > Why do we have these 2 packages ... ml and mlib? > What is the difference in these > > > > On Fri, Jul 22, 2016 a

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
Hi JG, If you didn't know this, Spark MLlib has 2 APIs, one of which uses DataFrames. Take a look at this example https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java This example uses a Dataset, which is

Re: spark-submit local and Akka startup timeouts

2016-07-19 Thread Bryan Cutler
in England and Wales. > Registered number: 02675207. > Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 > 7DY, UK. > -- > *From:* Bryan Cutler <cutl...@gmail.com> > *Sent:* 19 July 2016 02:20:38 > *To:* Rory Waite > *C

Re: spark-submit local and Akka startup timeouts

2016-07-18 Thread Bryan Cutler
Hi Rory, for starters what version of Spark are you using? I believe that in a 1.5.? release (I don't know which one off the top of my head) there was an addition that would also display the config property when a timeout happened. That might help some if you are able to upgrade. On Jul 18,

Re: Random Forest Classification

2016-07-08 Thread Bryan Cutler
ce in file formats. > > 3. I tested this same code on another Spark cloud platform and it displays > the same symptoms when run there. > > Thanks. > Rich > > > On Wed, Jun 29, 2016 at 12:59 AM, Bryan Cutler <cutl...@gmail.com> wrote: > >> Are you fitting t

Re: ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Bryan Cutler
Can you try running the example like this ./bin/run-example sql.RDDRelation I know there are some jars in the example folders, and running them this way adds them to the classpath On Jul 7, 2016 3:47 AM, "kevin" wrote: > hi,all: > I build spark use: > >

Re: Set the node the spark driver will be started

2016-06-29 Thread Bryan Cutler
Hi Felix, I think the problem you are describing has been fixed in later versions, check out this JIRA https://issues.apache.org/jira/browse/SPARK-13803 On Wed, Jun 29, 2016 at 9:27 AM, Mich Talebzadeh wrote: > Fine. in standalone mode spark uses its own scheduling

Re: Random Forest Classification

2016-06-28 Thread Bryan Cutler
Are you fitting the VectorIndexer to the entire data set and not just training or test data? If you are able to post your code and some data to reproduce, that would help in troubleshooting. On Tue, Jun 28, 2016 at 4:40 PM, Rich Tarro wrote: > Thanks for the response, but

Re: Random Forest Classification

2016-06-28 Thread Bryan Cutler
The problem might be that you are evaluating with "predictionLabel" instead of "prediction", where predictionLabel is the prediction index mapped to the original label strings - at least according to the RandomForestClassifierExample, not sure if your code is exactly the same. On Tue, Jun 28,

Re: LogisticRegression.scala ERROR, require(Predef.scala)

2016-06-23 Thread Bryan Cutler
The stack trace you provided seems to hint that you are calling "predict" on an RDD with Vectors that are not the same size as the number of features in your trained model, they should be equal. If that's not the issue, it would be easier to troubleshoot if you could share your code and possibly

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
plication master should run in the yarn > conf? I haven't found any useful information regarding that. > > Thanks. > > On Mon, Jun 6, 2016 at 4:52 PM, Bryan Cutler <cutl...@gmail.com> wrote: > >> In that mode, it will run on the application master, whichever node th

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
In that mode, it will run on the application master, whichever node that is as specified in your yarn conf. On Jun 5, 2016 4:54 PM, "Saiph Kappa" wrote: > Hi, > > In yarn-cluster mode, is there any way to specify on which node I want the > driver to run? > > Thanks. >

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Bryan Cutler
This is currently being worked on, planned for 2.1 I believe https://issues.apache.org/jira/browse/SPARK-7159 On May 28, 2016 9:31 PM, "Stephen Boesch" wrote: > Thanks Phuong But the point of my post is how to achieve without using > the deprecated the mllib pacakge. The

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler
/scala/org/apache/spark/examples/mllib/RecommendationExample.scala#L62 On Fri, Mar 11, 2016 at 8:18 PM, Shishir Anshuman <shishiranshu...@gmail.com > wrote: > The model produced after training. > > On Fri, Mar 11, 2016 at 10:29 PM, Bryan Cutler <cutl...@gmail.com> wrote: > &

Re: Get output of the ALS algorithm.

2016-03-11 Thread Bryan Cutler
Are you trying to save predictions on a dataset to a file, or the model produced after training with ALS? On Thu, Mar 10, 2016 at 7:57 PM, Shishir Anshuman wrote: > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS

Re: LDA topic Modeling spark + python

2016-02-29 Thread Bryan Cutler
orpus = grouped.zipWithIndex().map(lambda (term_counts, doc_id): [doc_id, > term_counts]).cache() > > #corpus.cache() > > model = LDA.train(corpus, k=10, maxIterations=10, optimizer="online") > > #ldaModel = LDA.train(corpus, k=3) > > print corpus > > topic

Re: LDA topic Modeling spark + python

2016-02-25 Thread Bryan Cutler
I'm not exactly sure how you would like to setup your LDA model, but I noticed there was no Python example for LDA in Spark. I created this issue to add it https://issues.apache.org/jira/browse/SPARK-13500. Keep an eye on this if it could be of help. bryan On Wed, Feb 24, 2016 at 8:34 PM,

Re: Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Bryan Cutler
Using flatmap on a string will treat it as a sequence, which is why you are getting an RDD of char. I think you want to just do a map instead. Like this val timestamps = stream.map(event => event.getCreatedAt.toString) On Feb 25, 2016 8:27 AM, "Dominik Safaric" wrote:

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
>> >> [4,1, 3083.2778025] >> >> [2, 4, 6226.40232139] >> >> [1, 2, 785.84266] >> >> [5, 1, 6706.05424139] >> >> >> >> and monitor. please let know if I missed something >> >> Krishna >> >> &g

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
gence) > > But am seeing same centers always for the entire duration - ran the app > for several hours with a custom receiver. > > Yes I am using the latestModel to predict using "labeled" test data. But > also like to know where my centers are > > regards > K

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
Could you elaborate where the issue is? You say calling model.latestModel.clusterCenters.foreach(println) doesn't show an updated model, but that is just a single statement to print the centers once.. Also, is there any reason you don't predict on the test data like this?

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Bryan Cutler
ratch5/hadoop/yarn/local/usercache//appcache/application_1450370639491_0239/container_1450370639491_0239_01_01/pyspark.zip:/scratch5/hadoop/yarn/local/usercache//appcache/application_1450370639491_0239/container_1450370639491_0239_01_01/py4j-0.9-src.zip'), >>>> ('PYTHONUNBUFFERED', 'YES

Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
response. Is there any workaround? > > > > *From:* Bryan Cutler [mailto:cutl...@gmail.com] > *Sent:* Thursday, January 14, 2016 2:19 PM > *To:* Rachana Srivastava > *Cc:* user@spark.apache.org; d...@spark.apache.org > *Subject:* Re: Random Forest FeatureImportance throwing > Nul

Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
Hi Rachana, I got the same exception. It is because computing the feature importance depends on impurity stats, which is not calculated with the old RandomForestModel in MLlib. Feel free to create a JIRA for this if you think it is necessary, otherwise I believe this problem will be eventually

Re: SparkContext SyntaxError: invalid syntax

2016-01-13 Thread Bryan Cutler
yarn --deploy-mode client > --driver-memory 4g --executor-memory 2g --executor-cores 1 > ./examples/src/main/python/pi.py 10* > I get the error I mentioned in the prior email: > Error from python worker: > python: module pyspark.daemon not found > > Any thoughts? >

Re: SparkContext SyntaxError: invalid syntax

2016-01-08 Thread Bryan Cutler
Hi Andrew, I know that older versions of Spark could not run PySpark on YARN in cluster mode. I'm not sure if that is fixed in 1.6.0 though. Can you try setting deploy-mode option to "client" when calling spark-submit? Bryan On Thu, Jan 7, 2016 at 2:39 PM, weineran <

Re: error writing to stdout

2016-01-06 Thread Bryan Cutler
This is a known issue https://issues.apache.org/jira/browse/SPARK-9844. As Noorul said, it is probably safe to ignore as the executor process is already destroyed at this point. On Mon, Dec 21, 2015 at 8:54 PM, Noorul Islam K M wrote: > carlilek

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
llect().toString()); total += rdd.count(); } } MyFunc f = new MyFunc(); inputStream.foreachRDD(f); // f.total will have the count of all RDDs Hope that helps some! -bryan On Wed, Dec 16, 2015 at 8:37 AM, Bryan Cutler <cutl...@gmail.com> wrote: > Hi Andy, > >

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
Hi Andy, Regarding the foreachrdd return value, this Jira that will be in 1.6 should take care of that https://issues.apache.org/jira/browse/SPARK-4557 and make things a little simpler. On Dec 15, 2015 6:55 PM, "Andy Davidson" wrote: > I am writing a JUnit test

Re: ALS mllib.recommendation vs ml.recommendation

2015-12-15 Thread Bryan Cutler
Hi Roberto, 1. How do they differ in terms of performance? They both use alternating least squares matrix factorization, the main difference is ml.recommendation.ALS uses DataFrames as input which has built-in optimizations and should give better performance 2. Am I correct to assume

Re: SparkStreaming variable scope

2015-12-09 Thread Bryan Cutler
rowid from your code is a variable in the driver, so it will be evaluated once and then only the value is sent to words.map. You probably want to have rowid be a lambda itself, so that it will get the value at the time it is evaluated. For example if I have the following: >>> data =

Re: Working with RDD from Java

2015-11-17 Thread Bryan Cutler
Hi Ivan, Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the topic distributions called javaTopicDistributions() that returns a JavaPairRDD. If you aren't able to upgrade, you can check out the conversion used here

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-07-30 Thread Bryan Cutler
Hi Praveen, In MLLib, the major difference is that RandomForestClassificationModel makes use of a newer API which utilizes ML pipelines. I can't say for certain if they will produce the same exact result for a given dataset, but I believe they should. Bryan On Wed, Jul 29, 2015 at 12:14 PM,

Re: Timeout Error

2015-04-26 Thread Bryan Cutler
I'm not sure what the expected performance should be for this amount of data, but you could try to increase the timeout with the property spark.akka.timeout to see if that helps. Bryan On Sun, Apr 26, 2015 at 6:57 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I'm trying to