Re: Ability to have CountVectorizerModel vocab as empty
Thanks Sean for the quick response. Logged a Jira: https://issues.apache.org/jira/browse/SPARK-32662 Will send a pull request shortly. Regards, Jatin On Wed, Aug 19, 2020 at 6:58 PM Sean Owen wrote: > I think that's true. You're welcome to open a pull request / JIRA to > remove that requirement. > > On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri wrote: > > > > Hello, > > > > This is wrt > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 > > > > require(vocab.length > 0, "The vocabulary size should be > 0. Lower > minDF as necessary.") > > > > Currently, if `CountVectorizer` is trained on an empty dataset results > in the following exception. But it is perfectly valid use case to send it > empty data (or if minDF filters everything). > > HashingTF works fine in such scenarios. CountVectorizer doesn't. > > > > Can we remove this constraint? Happy to send a pull-request > > > > java.lang.IllegalArgumentException: requirement failed: The vocabulary > size should be > 0. Lower minDF as necessary. > > at scala.Predef$.require(Predef.scala:224) > > at > org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236) > > at > org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149) > > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) > > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) > > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > -- Jatin Puri http://jatinpuri.com <http://www.jatinpuri.com>
Ability to have CountVectorizerModel vocab as empty
Hello, This is wrt https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.") Currently, if `CountVectorizer` is trained on an empty dataset results in the following exception. But it is perfectly valid use case to send it empty data (or if minDF filters everything). HashingTF works fine in such scenarios. CountVectorizer doesn't. Can we remove this constraint? Happy to send a pull-request java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. Lower minDF as necessary. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236) at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
Re: Java 11 support in Spark 2.5
>From this >(http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27966), > looks like there is no confirmation yet if at all Spark 2.5 would have JDK 11 >support. Spark 3 would most likely be out soon (tentatively this quarter as per mailing list). Spark 3 is going to have JDK 11 support. From: Sinha, Breeta (Nokia - IN/Bangalore) Sent: Thursday, January 2, 2020 12:48 PM To: user@spark.apache.org Cc: Rao, Abhishek (Nokia - IN/Bangalore) ; Imandi, Srinivas (Nokia - IN/Bangalore) Subject: Java 11 support in Spark 2.5 Hi All, Wanted to know if Java 11 support is added in Spark 2.5. If so, what is the expected timeline for Spark 2.5 release? Kind Regards, Breeta Sinha
Re: Lightweight pipeline execution for single eow
Using FAIR mode. If no other way. I think there is a limitation on number of parallel jobs that spark can run. Is there a way that more number of jobs can run in parallel. This is alright because, this sparkcontext would only be used during web service calls. I looked at spark configuration page and tried a few. But they didnt seem to work. I am using spark 2.3.1 Thanks. On Sun, Sep 23, 2018 at 6:00 PM Michael Artz wrote: > Are you using the scheduler in fair mode instead of fifo mode? > > Sent from my iPhone > > > On Sep 22, 2018, at 12:58 AM, Jatin Puri wrote: > > > > Hi. > > > > What tactics can I apply for such a scenario. > > > > I have a pipeline of 10 stages. Simple text processing. I train the data > with the pipeline and for the fitted data, do some modelling and store the > results. > > > > I also have a web-server, where I receive requests. For each request > (dataframe of single row), I transform against the same pipeline created > above. And do the respective action. The problem is: calling spark for > single row takes less than 1 second, but under higher load, spark > becomes a major bottleneck. > > > > One solution that I can think of, is to have scala re-implementation > of the same pipeline, and with the help of the model generated above, > process the requests. But this results in duplication of code and hence > maintenance. > > > > Is there any way, that I can call the same pipeline (transform) in a > very light manner, and just for single row. So that it just works > concurrently and spark does not remain a bottlenect? > > > > Thanks > > Jatin > -- Jatin Puri http://jatinpuri.com <http://www.jatinpuri.com>
Lightweight pipeline execution for single eow
Hi. What tactics can I apply for such a scenario. I have a pipeline of 10 stages. Simple text processing. I train the data with the pipeline and for the fitted data, do some modelling and store the results. I also have a web-server, where I receive requests. For each request (dataframe of single row), I transform against the same pipeline created above. And do the respective action. The problem is: calling spark for single row takes less than 1 second, but under higher load, spark becomes a major bottleneck. One solution that I can think of, is to have scala re-implementation of the same pipeline, and with the help of the model generated above, process the requests. But this results in duplication of code and hence maintenance. Is there any way, that I can call the same pipeline (transform) in a very light manner, and just for single row. So that it just works concurrently and spark does not remain a bottlenect? Thanks Jatin
Spark with Scala 2.12
Hello. I am wondering, if there is any new update on Spark upgrade to Scala 2.12. https://issues.apache.org/jira/browse/SPARK-14220. Especially given that Scala 2.13 is near the vicinity of a release. This is because, there is no recent update on the Jira and related ticket. May be someone is actively working on it, just that I am not aware. The fix looks like a difficult one. it would be great if there could be some indication on the timeline, that helps us plan better. And it looks like a non-trivial one, for someone like me to help out with it during my free time. Hence, I can only request. Thanks for all the great work. Regards, Jatin