Re: spark distribution build fails

2022-03-17 Thread Martin Grigorov
Hi,

For the mail archives: this error happens when the user has MAVEN_OPTS env
var pre-exported. In this case ./build/mvn|sbt does not export its own
MAVEN_OPTS with the -XssXYZ value, and the default one is too low and leads
to the StackOverflowError

On Mon, Mar 14, 2022 at 11:13 PM Bulldog20630405 
wrote:

>
> thanx; that worked great!
>
> On Mon, Mar 14, 2022 at 11:17 AM Sean Owen  wrote:
>
>> Try increasing the stack size in the build. It's the Xss argument you
>> find in various parts of the pom or sbt build. I have seen this and not
>> sure why it happens on certain envs, but that's the workaround
>>
>> On Mon, Mar 14, 2022, 8:59 AM Bulldog20630405 
>> wrote:
>>
>>>
>>> using tag v3.2.1 with java 8 getting a stackoverflow when building the
>>> distribution:
>>>
>>> > alias mvn
>>> alias mvn='mvn --errors --fail-at-end -DskipTests '
>>> > dev/make-distribution.sh --name 'hadoop-3.2' --pip --tgz -Phive
>>> -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>>
>>> [INFO]
>>> 
>>> [INFO] Reactor Summary for Spark Project Parent POM 3.2.1:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS [
>>>  2.978 s]
>>> [INFO] Spark Project Tags . SUCCESS [
>>>  6.585 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>>  6.684 s]
>>> [INFO] Spark Project Local DB . SUCCESS [
>>>  2.497 s]
>>> [INFO] Spark Project Networking ... SUCCESS [
>>>  6.312 s]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>>  3.925 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>>  7.879 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>>  2.238 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [02:33 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 24.566 s]
>>> [INFO] Spark Project GraphX ... SUCCESS [
>>> 28.293 s]
>>> [INFO] Spark Project Streaming  SUCCESS [
>>> 51.070 s]
>>> [INFO] Spark Project Catalyst . FAILURE [
>>> 36.920 s]
>>> [INFO] Spark Project SQL .. SKIPPED
>>> [INFO] Spark Project ML Library ... SKIPPED
>>> [INFO] Spark Project Tools  SKIPPED
>>> 
>>>
>>> [INFO] Spark Avro . SKIPPED
>>> [INFO]
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> 
>>> [INFO] Total time:  05:33 min
>>> [INFO] Finished at: 2022-03-14T13:45:15Z
>>> [INFO]
>>> 
>>> ---
>>> constituent[0]:
>>> file:/home/bulldog/software/maven/maven-3.8.4/conf/logging/
>>> constituent[1]:
>>> file:/home/bulldog/software/maven/maven-3.8.4/lib/maven-embedder-3.8.4.jar
>>> constituent[2]:
>>> file:/home/bulldog/software/maven/maven-3.8.4/lib/maven-settings-3.8.4.jar
>>> constituent[3]:
>>> file:/home/bulldog/software/maven/maven-3.8.4/lib/maven-settings-builder-3.8.4.jar
>>> constituent[4]:
>>> file:/home/bulldog/software/maven/maven-3.8.4/lib/maven-plugin-api-3.8.4.jar
>>> 
>>> ---
>>> Exception in thread "main" java.lang.StackOverflowError
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:49)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> ...
>>>
>>>
>>>
>>>
>>>
>>>


Re: Continuous ML model training in stream mode

2022-03-17 Thread Sean Owen
(Thank you, not sure that was me though)
I don't know of plans to expose the streaming impls in ML, as they still
work fine in MLlib and they also don't come up much. Continuous training is
relatively rare, maybe under-appreciated, but rare in practice.

On Thu, Mar 17, 2022 at 1:57 PM Gourav Sengupta 
wrote:

> Dear friends,
>
> a few years ago, I was in a London meetup seeing Sean (Owen) demonstrate
> how we can try to predict the gender of individuals who are responding to
> tweets after accepting privacy agreements, in case I am not wrong.
>
> It was real time, it was spectacular, and it was the presentation that set
> me into data science and its applications.
>
> Thanks Sean! :)
>
> Regards,
> Gourav Sengupta
>
>
>
>
> On Tue, Mar 15, 2022 at 9:39 PM Artemis User 
> wrote:
>
>> Thanks Sean!  Well, it looks like we have to abandon our structured
>> streaming model to use DStream for this, or do you see possibility to use
>> structured streaming with ml instead of mllib?
>>
>> On 3/15/22 4:51 PM, Sean Owen wrote:
>>
>> There is a streaming k-means example in Spark.
>> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
>>
>> On Tue, Mar 15, 2022, 3:46 PM Artemis User 
>> wrote:
>>
>>> Has anyone done any experiments of training an ML model using stream
>>> data? especially for unsupervised models?   Any suggestions/references
>>> are highly appreciated...
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: Continuous ML model training in stream mode

2022-03-17 Thread Gourav Sengupta
Dear friends,

a few years ago, I was in a London meetup seeing Sean (Owen) demonstrate
how we can try to predict the gender of individuals who are responding to
tweets after accepting privacy agreements, in case I am not wrong.

It was real time, it was spectacular, and it was the presentation that set
me into data science and its applications.

Thanks Sean! :)

Regards,
Gourav Sengupta




On Tue, Mar 15, 2022 at 9:39 PM Artemis User  wrote:

> Thanks Sean!  Well, it looks like we have to abandon our structured
> streaming model to use DStream for this, or do you see possibility to use
> structured streaming with ml instead of mllib?
>
> On 3/15/22 4:51 PM, Sean Owen wrote:
>
> There is a streaming k-means example in Spark.
> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
>
> On Tue, Mar 15, 2022, 3:46 PM Artemis User  wrote:
>
>> Has anyone done any experiments of training an ML model using stream
>> data? especially for unsupervised models?   Any suggestions/references
>> are highly appreciated...
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: [Pyspark] [Linear Regression] Can't Fit Data

2022-03-17 Thread Sean Owen
The error points you to the answer. Somewhere in your code you are parsing
dates, and the date format is no longer valid / supported. These changes
are doc'ed in the docs it points you to.
It is not related to the regression itself.

On Thu, Mar 17, 2022 at 11:35 AM Bassett, Kenneth
 wrote:

> Hello,
>
>
>
> I am having an issue with Linear Regression when trying to fit training
> data to the model. The code below used to work, but it stopped recently.
> Spark is version 3.2.1.
>
>
>
> # Split Data into train and test data
>
> train, test = data.randomSplit([0.9, 0.1])
>
> y = ’Build_Rate’
>
>
>
> # Perform regression with train data
>
> assembler = VectorAssembler(inputCols=feature_cols, outputCol="Features")
>
> vtrain = assembler.transform(train).select('Features', y)
>
> lin_reg = LinearRegression(regParam = 0.0, elasticNetParam = 0.0,
> solver='normal', featuresCol = 'Features', labelCol = y)
>
> model = lin_reg.fit(vtrain) *FAILS HERE*
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 388.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 388.0 (TID 422) (10.139.64.4 executor 0):
> org.apache.spark.SparkUpgradeException: You may get a different result due
> to the upgrading of Spark 3.0: Fail to recognize MMM dd,  hh:mm:ss
> aa pattern in the DateTimeFormatter. 1) You can set
> spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
> Spark 3.0. 2) You can form a valid datetime pattern with the guide from
> https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
>
>
>
> The full traceback is attached.
>
>
>
> The error is confusing me because there are no datetime columns in
> “train”. “vtrain” is just “train” with the feature columns in dense vector
> form.
>
> Does anyone know how to fix this error?
>
>
>
> Thanks,
>
>
> *Ken Bassett **Data Scientist *
>
>
>
>
>
>
>
> 1451 Marvin Griffin Rd.
> Augusta, GA 30906
>
> (m) (706) 469-0696
>
> kbass...@textron.com
>
>
>
> [image: 2019 E-mail Signature]
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


[Pyspark] [Linear Regression] Can't Fit Data

2022-03-17 Thread Bassett, Kenneth
Hello,

I am having an issue with Linear Regression when trying to fit training data to 
the model. The code below used to work, but it stopped recently. Spark is 
version 3.2.1.

# Split Data into train and test data
train, test = data.randomSplit([0.9, 0.1])
y = 'Build_Rate'

# Perform regression with train data
assembler = VectorAssembler(inputCols=feature_cols, outputCol="Features")
vtrain = assembler.transform(train).select('Features', y)
lin_reg = LinearRegression(regParam = 0.0, elasticNetParam = 0.0, 
solver='normal', featuresCol = 'Features', labelCol = y)
model = lin_reg.fit(vtrain) FAILS HERE

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 388.0 failed 4 times, most recent failure: Lost task 0.3 in stage 388.0 
(TID 422) (10.139.64.4 executor 0): org.apache.spark.SparkUpgradeException: You 
may get a different result due to the upgrading of Spark 3.0: Fail to recognize 
MMM dd,  hh:mm:ss aa pattern in the DateTimeFormatter. 1) You can 
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before 
Spark 3.0. 2) You can form a valid datetime pattern with the guide from 
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

The full traceback is attached.

The error is confusing me because there are no datetime columns in "train". 
"vtrain" is just "train" with the feature columns in dense vector form.
[cid:image002.png@01D839F5.9EEB4860]
Does anyone know how to fix this error?

Thanks,
Ken Bassett
Data Scientist



1451 Marvin Griffin Rd.
Augusta, GA 30906
(m) (706) 469-0696
kbass...@textron.com

[2019 E-mail Signature]

---
Py4JJavaError Traceback (most recent call last)
 in 
> 1 model = lin_reg.fit(vtrain)

/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py in 
patched_method(self, *args, **kwargs)
 28 call_succeeded = False
 29 try:
---> 30 result = original_method(self, *args, **kwargs)
 31 call_succeeded = True
 32 return result

/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise TypeError("Params must be either a param map or a 
list/tuple of param maps, "

/databricks/spark/python/pyspark/ml/wrapper.py in _fit(self, dataset)
333 
334 def _fit(self, dataset):
--> 335 java_model = self._fit_java(dataset)
336 model = self._create_model(java_model)
337 return self._copyValues(model)

/databricks/spark/python/pyspark/ml/wrapper.py in _fit_java(self, dataset)
330 """
331 self._transfer_params_to_java()
--> 332 return self._java_obj.fit(dataset._jdf)
333 
334 def _fit(self, dataset):

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1302 
   1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
   1305 answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o1033.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 413.0 failed 4 times, most recent failure: Lost task 0.3 in stage 413.0 
(TID 461) (10.139.64.4 executor 0): org.apache.spark.SparkUpgradeException: You 
may get a different result due to the upgrading of Spark 3.0: Fail to recognize 
'MMM dd,  hh:mm:ss aa' pattern in the DateTimeFormatter. 1) You can set 
spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before 
Spark 3.0. 2) You can form a valid datetime pattern with the guide from 
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.failToRecognizePatternAfterUpgradeError(QueryExecutionErrors.scala:1054)
at