date:20170712

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Jeff Zhang

Awesome !

Hyukjin Kwon 于2017年7月13日周四 上午8:48写道：

> Cool!
>
> 2017-07-13 9:43 GMT+09:00 Denny Lee :
>
>> This is amazingly awesome! :)
>>
>> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
>> wrote:
>>
>>> That's great!
>>>
>>>
>>>
>>> On 12 July 2017 at 12:41, Felix Cheung 
>>> wrote:
>>>
 Awesome! Congrats!!

 --
 *From:* holden.ka...@gmail.com  on behalf of
 Holden Karau 
 *Sent:* Wednesday, July 12, 2017 12:26:00 PM
 *To:* user@spark.apache.org
 *Subject:* With 2.2.0 PySpark is now available for pip install from
 PyPI :)

 Hi wonderful Python + Spark folks,

 I'm excited to announce that with Spark 2.2.0 we finally have PySpark
 published on PyPI (see https://pypi.python.org/pypi/pyspark /
 https://twitter.com/holdenkarau/status/885207416173756417). This has
 been a long time coming (previous releases included pip installable
 artifacts that for a variety of reasons couldn't be published to PyPI). So
 if you (or your friends) want to be able to work with PySpark locally on
 your laptop you've got an easier path getting started (pip install 
 pyspark).

 If you are setting up a standalone cluster your cluster will still need
 the "full" Spark packaging, but the pip installed PySpark should be able to
 work with YARN or an existing standalone cluster installation (of the same
 version).

 Happy Sparking y'all!

 Holden :)

 --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Hyukjin Kwon

Cool!

2017-07-13 9:43 GMT+09:00 Denny Lee :

> This is amazingly awesome! :)
>
> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
> wrote:
>
>> That's great!
>>
>>
>>
>> On 12 July 2017 at 12:41, Felix Cheung  wrote:
>>
>>> Awesome! Congrats!!
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* With 2.2.0 PySpark is now available for pip install from
>>> PyPI :)
>>>
>>> Hi wonderful Python + Spark folks,
>>>
>>> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
>>> published on PyPI (see https://pypi.python.org/pypi/pyspark /
>>> https://twitter.com/holdenkarau/status/885207416173756417). This has
>>> been a long time coming (previous releases included pip installable
>>> artifacts that for a variety of reasons couldn't be published to PyPI). So
>>> if you (or your friends) want to be able to work with PySpark locally on
>>> your laptop you've got an easier path getting started (pip install pyspark).
>>>
>>> If you are setting up a standalone cluster your cluster will still need
>>> the "full" Spark packaging, but the pip installed PySpark should be able to
>>> work with YARN or an existing standalone cluster installation (of the same
>>> version).
>>>
>>> Happy Sparking y'all!
>>>
>>> Holden :)
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Denny Lee

This is amazingly awesome! :)

On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
wrote:

> That's great!
>
>
>
> On 12 July 2017 at 12:41, Felix Cheung  wrote:
>
>> Awesome! Congrats!!
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
>> *To:* user@spark.apache.org
>> *Subject:* With 2.2.0 PySpark is now available for pip install from PyPI
>> :)
>>
>> Hi wonderful Python + Spark folks,
>>
>> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
>> published on PyPI (see https://pypi.python.org/pypi/pyspark /
>> https://twitter.com/holdenkarau/status/885207416173756417). This has
>> been a long time coming (previous releases included pip installable
>> artifacts that for a variety of reasons couldn't be published to PyPI). So
>> if you (or your friends) want to be able to work with PySpark locally on
>> your laptop you've got an easier path getting started (pip install pyspark).
>>
>> If you are setting up a standalone cluster your cluster will still need
>> the "full" Spark packaging, but the pip installed PySpark should be able to
>> work with YARN or an existing standalone cluster installation (of the same
>> version).
>>
>> Happy Sparking y'all!
>>
>> Holden :)
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

Re: DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

2017-07-12 Thread Yong Zhang

Can't you just catch that exception and return an empty dataframe?

Yong

From: Sumona Routh 
Sent: Wednesday, July 12, 2017 4:36 PM
To: user
Subject: DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: 
Path does not exist

Hi there,
I'm trying to read a list of paths from S3 into a dataframe for a window of 
time using the following:

sparkSession.read.parquet(listOfPaths:_*)

In some cases, the path may not be there because there is no data, which is an 
acceptable scenario.
However, Spark throws an AnalysisException: Path does not exist. Is there an 
option I can set to tell it to gracefully return an empty dataframe if a 
particular path is missing? Looking at the spark code, there is an option 
checkFilesExist, but I don't believe that is set in the particular flow of code 
that I'm accessing.

Thanks!
Sumona

Implementing Dynamic Sampling in a Spark Streaming Application

2017-07-12 Thread N B

Hi all,

Spark has had a backpressure implementation since 1.5 that helps to
stabilize a Spark Streaming application in terms of keeping the processing
time/batch under control and less than the batch interval. This
implementation leaves excess records in the source (Kafka, Flume etc) and
they get picked up in subsequent batches.

However, there are use cases where it would be useful to pick up the whole
batch of records from the source and randomly sample it down to a
dynamically computed "desired" batch size. This would allow the application
to not lag behind in processing the latest traffic with the trade off being
that some traffic could be lost. I believe such a random sampling strategy
has been proposed in the original backpressure JIRA (SPARK-7398) design doc
but not natively implemented yet.

I have written a blog post about implementing such a technique in the
application using the PIDEstimator used in Spark's Backpressure
implementation and randomly sampling the batch using its outcome.

Implementing a Dynamic Sampling Strategy in a Spark Streaming Application


Hope that some people find it useful. Comments and discussion are welcome.

Thanks,
Nikunj

DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

2017-07-12 Thread Sumona Routh

Hi there,
I'm trying to read a list of paths from S3 into a dataframe for a window of
time using the following:

sparkSession.read.parquet(listOfPaths:_*)

In some cases, the path may not be there because there is no data, which is
an acceptable scenario.
However, Spark throws an AnalysisException: Path does not exist. Is there
an option I can set to tell it to gracefully return an empty dataframe if a
particular path is missing? Looking at the spark code, there is an option
checkFilesExist, but I don't believe that is set in the particular flow of
code that I'm accessing.

Thanks!
Sumona

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread lucas.g...@gmail.com

That's great!



On 12 July 2017 at 12:41, Felix Cheung  wrote:

> Awesome! Congrats!!
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
> *To:* user@spark.apache.org
> *Subject:* With 2.2.0 PySpark is now available for pip install from PyPI
> :)
>
> Hi wonderful Python + Spark folks,
>
> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
> published on PyPI (see https://pypi.python.org/pypi/pyspark /
> https://twitter.com/holdenkarau/status/885207416173756417). This has been
> a long time coming (previous releases included pip installable artifacts
> that for a variety of reasons couldn't be published to PyPI). So if you (or
> your friends) want to be able to work with PySpark locally on your laptop
> you've got an easier path getting started (pip install pyspark).
>
> If you are setting up a standalone cluster your cluster will still need
> the "full" Spark packaging, but the pip installed PySpark should be able to
> work with YARN or an existing standalone cluster installation (of the same
> version).
>
> Happy Sparking y'all!
>
> Holden :)
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Felix Cheung

Awesome! Congrats!!

From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Wednesday, July 12, 2017 12:26:00 PM
To: user@spark.apache.org
Subject: With 2.2.0 PySpark is now available for pip install from PyPI :)

Hi wonderful Python + Spark folks,

I'm excited to announce that with Spark 2.2.0 we finally have PySpark published 
on PyPI (see https://pypi.python.org/pypi/pyspark / 
https://twitter.com/holdenkarau/status/885207416173756417). This has been a 
long time coming (previous releases included pip installable artifacts that for 
a variety of reasons couldn't be published to PyPI). So if you (or your 
friends) want to be able to work with PySpark locally on your laptop you've got 
an easier path getting started (pip install pyspark).

If you are setting up a standalone cluster your cluster will still need the 
"full" Spark packaging, but the pip installed PySpark should be able to work 
with YARN or an existing standalone cluster installation (of the same version).

Happy Sparking y'all!

Holden :)

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Holden Karau

Hi wonderful Python + Spark folks,

I'm excited to announce that with Spark 2.2.0 we finally have PySpark
published on PyPI (see https://pypi.python.org/pypi/pyspark /
https://twitter.com/holdenkarau/status/885207416173756417). This has been a
long time coming (previous releases included pip installable artifacts that
for a variety of reasons couldn't be published to PyPI). So if you (or your
friends) want to be able to work with PySpark locally on your laptop you've
got an easier path getting started (pip install pyspark).

If you are setting up a standalone cluster your cluster will still need the
"full" Spark packaging, but the pip installed PySpark should be able to
work with YARN or an existing standalone cluster installation (of the same
version).

Happy Sparking y'all!

Holden :)


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-12 Thread Steve Loughran

On 10 Jul 2017, at 21:57, Everett Anderson
mailto:ever...@nuna.com>> wrote:

Hey,

Thanks for the responses, guys!

On Thu, Jul 6, 2017 at 7:08 AM, Steve Loughran
mailto:ste...@hortonworks.com>> wrote:

On 5 Jul 2017, at 14:40, Vadim Semenov
mailto:vadim.seme...@datadoghq.com>> wrote:

Are you sure that you use S3A?
Because EMR says that they do not support S3A

https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/
> Amazon EMR does not currently support use of the Apache Hadoop S3A file
> system.

Oof. I figured they didn't offer technical support for S3A, but didn't know
that there was something saying EMR does not support use of S3A. My impression
was that many people were using it and it's the recommended S3 library in
Hadoop 2.7+ from Hadoop's point of
view.

We're using it rather than S3N because we use encrypted buckets, and I don't
think S3N supports picking up credentials from a machine role. Also, it was a
bit distressing that it's unmaintained and has open bugs.

We're S3A rather than EMRFS because we have a setup where we submit work to a
cluster via spark-submit run outside the cluster master node with --master
yarn. When you do this, the Hadoop configuration accessible to spark-submit
overrides that of the EMR cluster itself. If you use a configuration that uses
EMRFS and any of the resources (like the JAR) you give to spark-submit are on
S3, spark-submit will instantiate the EMRFS FileSystem impl, which is currently
only available on the cluster, and fail. That said, we could work around this
by resetting the configuration in code.

or, if you are using the URL s3:// to refer to amazon EMRs, just edit your app
config so that fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem and use s3://
everywhere (use the fs.s3a. prefix for configuring s3 though)

I think that the HEAD requests come from the `createBucketIfNotExists` in the
AWS S3 library that checks if the bucket exists every time you do a PUT
request, i.e. creates a HEAD request.

You can disable that by setting `fs.s3.buckets.create.enabled` to `false`
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-upload-s3.html
Oh, interesting. We are definitely seeing a ton of HEAD requests, which might
be that. It looks like the `fs.s3.buckets.create.enabled` is an EMRFS option,
though, not one common to the Hadoop S3 FileSystem implementations. Does that
sound right?

Yeah, I'd like to see the stack traces before blaming S3A and the ASF codebase

(Sorry, to be clear -- I'm not trying to blame S3A. I figured someone else
might've hit this and bet we had just misconfigured something or were doing
this the wrong way.)

no worries,..if you are seeing problems, it's important to know where they are
surfacing.

One thing I do know is that the shipping S3A client doesn't have any explicit
handling of 503/retry events. I know that:
https://issues.apache.org/jira/browse/HADOOP-14531

There is some retry logic in bits of the AWS SDK related to file upload: that
may log and retry, but in all the operations listing files, getting their
details, etc: no resilience to throttling.

If it is surfacing against s3a, there isn't anything which can immediately be
done to fix it, other than "spread your data around more buckets". Do attach
the stack trace you get under
https://issues.apache.org/jira/browse/HADOOP-14381 though: I'm about half-way
through the resilience code (& fault injection needed to test it). The more
where I can see problems arise, the more confident I can be that those
codepaths will be resilient.

Will do!

We did end up finding that some of our jobs were sharding data way too finely,
ending up with 5-10k+ tiny Parquet shards per table. This happened when we
unioned many Spark DataFrames together without doing a repartition or coalesce
afterwards. After throwing in a repartition (to additionally balance the output
shards) we haven't seen the error, again, but our graphs of S3 HEAD requests
are still rather alarmingly high.

treewalking can be expensive that way; the more dirs you have, the more things
look around.

If you are using S3A, and Hadoop 2.8+, log the toString() value of the FS after
your submission. It'll give you a list of all the stats it collects, including
details fo high level API calls alongside low level HTTP requests:
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Statistic.java

[ML] Performance issues with GBTRegressor

2017-07-12 Thread OBones


Hello all,

I'm using Spark for medium to large datasets regression analysis and its 
performance are very great when using random forest or decision trees.
Continuing my experimentation, I started using GBTRegressor and am 
finding it extremely slow when compared to R while both other methods 
were very fast.

Two examples to illustrate :
 - on a 300k lines dataset, R takes 3 minutes and GBTRegressor 15 to 
process 2000 iterations, maxdepth = 1, MinInstancesPerNode = 50
 - on a 3M lines dataset, R takes 3 minutes and GBTRegressor 47 to 
process 10 iterations, maxdepth = 2, MinInstancesPerNode = 50


I placed the code for the first example at the end of this message.

For the 300k dataset, I understand that there is a setup cost associated 
to Spark which means that small datasets may not be processed as 
efficiently as in R, even if my testing with DecisionTree and 
RandomForest shows otherwise.
When I look at CPU usage for the GBT, it has spikes at 90% CPU usage (7 
out of 8 cores) for relatively short bursts and then goes back to 8/10% 
(less than one core) for quite a while.
Comparing to R that takes 1 core for its full 3 minutes run, it's quite 
surprising

What have I missed in my setup?

I've been told that the behavior I'm observing may be related to data 
skewness, but I'm not sure what's at hand here.


From my untrained eye, it looks as if there was an issue in the 
GBTRegressor class, but I can't figure it out.


Any help would be most welcome.

Regards

=
R code

train <- read.table("c:/Path/to/file.csv", header=T, sep=";",dec=".")
train$X1 <- factor(train$X1)
train$X2 <- factor(train$X2)
train$X3 <- factor(train$X3)
train$X4 <- factor(train$X4)
train$X5 <- factor(train$X5)
train$X6 <- factor(train$X6)
train$X7 <- factor(train$X7)
train$X8 <- factor(train$X8)
train$X9 <- factor(train$X9)

library(gbm)
boost <- gbm(Freq~X1+X2+X3+X4+X5+X6+X7+X8+X9+Y1, distribution = 
"gaussian", data = train, n.trees = 2000, bag.fraction = 1, shrinkY1 = 
1, interaction.depth = 1, n.minobsinnode = 50, train.fraction = 1.0, 
cv.folds = 0, keep.data = TRUE)


=
scala code for Spark

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.regression.GBTRegressor

val conf = new SparkConf()
  .setAppName("GBTExample")
  .set("spark.driver.memory", "8g")
  .set("spark.executor.memory", "8g")
  .set("spark.network.timeout", "120s")
val sc = SparkContext.getOrCreate(conf.setMaster("local[8]"))
val spark = new SparkSession.Builder().getOrCreate()
import spark.implicits._

val sourceData = spark.read.format("com.databricks.spark.csv")
  .option("header", "true")
  .option("delimiter", ";")
  .option("inferSchema", "true")
  .load("c:/Path/to/file.csv")

val data = sourceData.select($"X1", $"X2", $"X3", $"X4", $"X5", $"X6", 
$"X7", $"X8", $"X9", $"Y1".cast("double"), $"Freq".cast("double"))


val X1Indexer = new StringIndexer().setInputCol("X1").setOutputCol("X1Idx")
val X2Indexer = new StringIndexer().setInputCol("X2").setOutputCol("X2Idx")
val X3Indexer = new StringIndexer().setInputCol("X3").setOutputCol("X3Idx")
val X4Indexer = new StringIndexer().setInputCol("X4").setOutputCol("X4Idx")
val X5Indexer = new StringIndexer().setInputCol("X5").setOutputCol("X5Idx")
val X6Indexer = new StringIndexer().setInputCol("X6").setOutputCol("X6Idx")
val X7Indexer = new StringIndexer().setInputCol("X7").setOutputCol("X7Idx")
val X8Indexer = new StringIndexer().setInputCol("X8").setOutputCol("X8Idx")
val X9Indexer = new StringIndexer().setInputCol("X9").setOutputCol("X9Idx")

val assembler = new VectorAssembler()
  .setInputCols(Array("X1Idx", "X2Idx", "X3Idx", "X4Idx", "X5Idx", 
"X6Idx", "X7Idx", "X8Idx", "X9Idx", "Y1"))

  .setOutputCol("features")

val dt = new GBTRegressor()
  .setLabelCol("Freq")
  .setFeaturesCol("features")
  .setImpurity("variance")
  .setMaxIter(2000)
  .setMinInstancesPerNode(50)
  .setMaxDepth(1)
  .setStepSize(1)
  .setSubsamplingRate(1)
  .setMaxBins(32)

val pipeline = new Pipeline()
  .setStages(Array(X1Indexer, X2Indexer, X3Indexer, X4Indexer, 
X5Indexer, X6Indexer, X7Indexer, X8Indexer, X9Indexer, assembler, dt))


val model = pipeline.fit(data)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Testing another Dataset after ML training

2017-07-12 Thread Michael C. Kunkel

Greetings Riccardo,

That is indeed my post. That is my second attempt at getting this
problem to work. I am not sure if the vector size are different as I
know the "unknown" data is just a blind copy of 3 of the used inputs for
the training data.

I will pursue this avenue more.

Thanks for the correspondence.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 12/07/2017 14:05, Riccardo Ferrari wrote:

Hi Michael,

I think I found you posting on SO:
https://stackoverflow.com/questions/45041677/java-spark-training-on-new-data-with-datasetrow-from-csv-file

The exception trace there is quite different from what I read here,
and indeed is self-explanatory:

...
Caused by: java.lang.IllegalArgumentException: requirement failed: The
columns of A don't match the number of elements of x. A: 38611, x: 36179

...
Can it be that you have different 'features' vector sizes from train
and test?

Best,

On Wed, Jul 12, 2017 at 1:41 PM, Kunkel, Michael C.
mailto:m.kun...@fz-juelich.de>> wrote:

Greetings

The attachment I meant to refer to was the posting in the initial
email on the email list.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On Jul 12, 2017, at 09:56, Riccardo Ferrari mailto:ferra...@gmail.com>> wrote:

Hi Michael,

I don't see any attachment, not sure you can attach files though

On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel
mailto:m.kun...@fz-juelich.de>> wrote:

Greetings,

Thanks for the communication.

I attached the entire stacktrace in which was output to the
screen.
I tried to use JavaRDD and LabeledPoint then convert to
Dataset and I still get the same error as I did when I only
used datasets.

I am using the expected ml Vector. I tried it using the mllib
and that also didnt work.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 17:21, Riccardo Ferrari wrote:

Mh, to me feels like there some data mismatch. Are you sure
you're using the expected Vector (ml vs mllib). I am not
sure you attached the whole Exception but you might find
some more useful details there.
Best,
On Tue, Jul 11, 2017 at 3:07 PM, mckunkel
mailto:m.kun...@fz-juelich.de>> wrote:

Im not sure why I cannot subscribe, so that everyone can
view the conversation. Help? -- View this message in
context:

http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html

Sent from the Apache Spark User List mailing list
archive at Nabble.com .

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Forschungszentrum Juelich GmbH 52425 Juelich Sitz der
Gesellschaft: Juelich Eingetragen im Handelsregister des
Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des
Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt
(Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof.
Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt

Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari

Hi Michael,

I think I found you posting on SO:
https://stackoverflow.com/questions/45041677/java-spark-training-on-new-data-with-datasetrow-from-csv-file

The exception trace there is quite different from what I read here, and
indeed is self-explanatory:
...
Caused by: java.lang.IllegalArgumentException: requirement failed: The
columns of A don't match the number of elements of x. A: 38611, x: 36179
...
Can it be that you have different 'features' vector sizes from train and
test?

Best,

On Wed, Jul 12, 2017 at 1:41 PM, Kunkel, Michael C. 
wrote:

> Greetings
>
> The attachment I meant to refer to was the posting in the initial email on
> the email list.
>
> BR
> MK
> 
> Michael C. Kunkel, USMC, PhD
> Forschungszentrum Jülich
> Nuclear Physics Institute and Juelich Center for Hadron Physics
> Experimental Hadron Structure (IKP-1)
> www.fz-juelich.de/ikp
>
> On Jul 12, 2017, at 09:56, Riccardo Ferrari  wrote:
>
> Hi Michael,
>
> I don't see any attachment, not sure you can attach files though
>
> On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel <
> m.kun...@fz-juelich.de> wrote:
>
>> Greetings,
>>
>> Thanks for the communication.
>>
>> I attached the entire stacktrace in which was output to the screen.
>> I tried to use JavaRDD and LabeledPoint then convert to Dataset and I
>> still get the same error as I did when I only used datasets.
>>
>> I am using the expected ml Vector. I tried it using the mllib and that
>> also didnt work.
>>
>> BR
>> MK
>> 
>> Michael C. Kunkel, USMC, PhD
>> Forschungszentrum Jülich
>> Nuclear Physics Institute and Juelich Center for Hadron Physics
>> Experimental Hadron Structure (IKP-1)www.fz-juelich.de/ikp
>>
>> On 11/07/2017 17:21, Riccardo Ferrari wrote:
>>
>> Mh, to me feels like there some data mismatch. Are you sure you're using
>> the expected Vector (ml vs mllib). I am not sure you attached the whole
>> Exception but you might find some more useful details there.
>>
>> Best,
>>
>> On Tue, Jul 11, 2017 at 3:07 PM, mckunkel  wrote:
>>
>>> Im not sure why I cannot subscribe, so that everyone can view the
>>> conversation.
>>> Help?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-train
>>> ing-tp28845p28846.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>>
>> 
>> 
>> 
>> 
>> Forschungszentrum Juelich GmbH
>> 52425 Juelich
>> Sitz der Gesellschaft: Juelich
>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>> Prof. Dr. Sebastian M. Schmidt
>> 
>> 
>> 
>> 
>>
>>
>

Re: Testing another Dataset after ML training

2017-07-12 Thread Kunkel, Michael C.

Greetings

The attachment I meant to refer to was the posting in the initial email on the
email list.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On Jul 12, 2017, at 09:56, Riccardo Ferrari
mailto:ferra...@gmail.com>> wrote:

Hi Michael,

I don't see any attachment, not sure you can attach files though

On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel
mailto:m.kun...@fz-juelich.de>> wrote:

Greetings,

Thanks for the communication.

I attached the entire stacktrace in which was output to the screen.
I tried to use JavaRDD and LabeledPoint then convert to Dataset and I still get
the same error as I did when I only used datasets.

I am using the expected ml Vector. I tried it using the mllib and that also
didnt work.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 17:21, Riccardo Ferrari wrote:
Mh, to me feels like there some data mismatch. Are you sure you're using the
expected Vector (ml vs mllib). I am not sure you attached the whole Exception
but you might find some more useful details there.

Best,

On Tue, Jul 11, 2017 at 3:07 PM, mckunkel
mailto:m.kun...@fz-juelich.de>> wrote:
Im not sure why I cannot subscribe, so that everyone can view the
conversation.
Help?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe e-mail:
user-unsubscr...@spark.apache.org

Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt

CVE-2017-7678 Apache Spark XSS web UI MHTML vulnerability

2017-07-12 Thread Sean Owen

Severity: Low

Vendor: The Apache Software Foundation

Versions Affected:
Versions of Apache Spark before 2.2.0

Description:
It is possible for an attacker to take advantage of a user's trust in the
server to trick them into visiting a link that points to a shared Spark
cluster and submits data including MHTML to the Spark master, or history
server. This data, which could contain a script, would then be reflected
back to the user and could be evaluated and executed by MS Windows-based
clients. It is not an attack on Spark itself, but on the user, who may then
execute the script inadvertently when viewing elements of the Spark web UIs.

Mitigation:
Update to Apache Spark 2.2.0 or later.

Example:
Request:
GET
/app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
_AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
HTTP/1.1

Excerpt from response:
No running application with ID Content-Type:
multipart/related;
boundary=_AppScan
--_AppScan
Content-Location:foo
Content-Transfer-Encoding:base64
PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+


Result: In the above payload the BASE64 data decodes as:
alert("XSS")

Credit:
Mike Kasper, Nicholas Marion
IBM z Systems Center for Secure Engineering

Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari

Hi Michael,

I don't see any attachment, not sure you can attach files though

On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel 
wrote:

> Greetings,
>
> Thanks for the communication.
>
> I attached the entire stacktrace in which was output to the screen.
> I tried to use JavaRDD and LabeledPoint then convert to Dataset and I
> still get the same error as I did when I only used datasets.
>
> I am using the expected ml Vector. I tried it using the mllib and that
> also didnt work.
>
> BR
> MK
> 
> Michael C. Kunkel, USMC, PhD
> Forschungszentrum Jülich
> Nuclear Physics Institute and Juelich Center for Hadron Physics
> Experimental Hadron Structure (IKP-1)www.fz-juelich.de/ikp
>
> On 11/07/2017 17:21, Riccardo Ferrari wrote:
>
> Mh, to me feels like there some data mismatch. Are you sure you're using
> the expected Vector (ml vs mllib). I am not sure you attached the whole
> Exception but you might find some more useful details there.
>
> Best,
>
> On Tue, Jul 11, 2017 at 3:07 PM, mckunkel  wrote:
>
>> Im not sure why I cannot subscribe, so that everyone can view the
>> conversation.
>> Help?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-train
>> ing-tp28845p28846.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
>
> 
> 
> 
> 
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> 
> 
> 
> 
>
>

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

Re: DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

Implementing Dynamic Sampling in a Spark Streaming Application

DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

With 2.2.0 PySpark is now available for pip install from PyPI :)

Re: Spark, S3A, and 503 SlowDown / rate limit issues

[ML] Performance issues with GBTRegressor

Re: Testing another Dataset after ML training

Re: Testing another Dataset after ML training

Re: Testing another Dataset after ML training

CVE-2017-7678 Apache Spark XSS web UI MHTML vulnerability

Re: Testing another Dataset after ML training

16 matches

Site Navigation

Mail list logo

Footer information