Re: Crowdsourced triage Scapegoat compiler plugin warnings

2017-07-13 Thread Jorge Sánchez
It might be worth sharing this with the user list, there must be people
willing to collaborate who are not on the dev list.

2017-07-13 10:00 GMT+01:00 Sean Owen :

> I don't think everything needs to be triaged. There are a ton of useful
> changes that have been identified. I think you could just pick some warning
> types where they've all been triaged and go fix them.
>
> On Thu, Jul 13, 2017 at 9:16 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>>
>> Another gentle ping for help.
>>
>> Probably, let me open up a JIRA and proceed this after a couple of weeks
>> if no one is going to do this although I hope someone takes this.
>>
>>
>>


Some Spark MLLIB tests failing due to some classes not being registered with Kryo

2017-11-11 Thread Jorge Sánchez
Hi Dev,

I'm running the MLLIB tests in the current Master branch and the following
Suites are failing due to some classes not being registered with Kryo:

org.apache.spark.mllib.MatricesSuite
org.apache.spark.mllib.VectorsSuite
org.apache.spark.ml.InstanceSuite

I can solve it by registering the failing classes with Kryo, but I'm
wondering if I'm missing something as these tests shouldn't be failing from
Master.

Any suggestions on what I may be doing wrong?

Thank you.


Re: Some Spark MLLIB tests failing due to some classes not being registered with Kryo

2017-11-11 Thread Jorge Sánchez
No luck running the full test suites with mvn test from the main folder or
just mvn -pl mllib.

Any other suggestion would be much appreciated.

Thank you.

2017-11-11 12:46 GMT+00:00 Marco Gaido :

> Hi Jorge,
>
> then try running the tests not from the mllib folder, but on Spark base
> directory.
> If you want to run only the tests in mllib, you can specify the project
> using the -pl argument of mvn.
>
> Thanks,
> Marco
>
>
>
> 2017-11-11 13:37 GMT+01:00 Jorge Sánchez :
>
>> Hi Marco,
>>
>> Just mvn test from the mllib folder.
>>
>> Thank you.
>>
>> 2017-11-11 12:36 GMT+00:00 Marco Gaido :
>>
>>> Hi Jorge,
>>>
>>> how are you running those tests?
>>>
>>> Thanks,
>>> Marco
>>>
>>> 2017-11-11 13:21 GMT+01:00 Jorge Sánchez :
>>>
>>>> Hi Dev,
>>>>
>>>> I'm running the MLLIB tests in the current Master branch and the
>>>> following Suites are failing due to some classes not being registered with
>>>> Kryo:
>>>>
>>>> org.apache.spark.mllib.MatricesSuite
>>>> org.apache.spark.mllib.VectorsSuite
>>>> org.apache.spark.ml.InstanceSuite
>>>>
>>>> I can solve it by registering the failing classes with Kryo, but I'm
>>>> wondering if I'm missing something as these tests shouldn't be failing from
>>>> Master.
>>>>
>>>> Any suggestions on what I may be doing wrong?
>>>>
>>>> Thank you.
>>>>
>>>
>>>
>>
>


Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-15 Thread Jorge Sánchez
Hi,

after seeing that IDF needed refactoring to use ML vectors instead of MLLib
ones, I have created a Jira ticket in
 https://issues.apache.org/jira/browse/SPARK-22531
 and submitted a PR for
it.
If anyone can have a look and suggest any changes it would be really
appreciated.

Thank you.


2017-11-15 1:11 GMT+00:00 Bago Amirbekian :

> There is a known issue with VectorAssembler which causes it to fail in
> streaming if any of the input columns are of VectorType & don't have size
> information, https://issues.apache.org/jira/browse/SPARK-22346.
>
> This can be fixed by adding size information to the vector columns, I've
> made a PR to add a transformer to spark to help with this,
> https://github.com/apache/spark/pull/19746. It would be awesome if you
> could take a look and see if this would fix your issue.
>
> On Sun, Nov 12, 2017 at 5:37 PM Davis Varghese  wrote:
>
>> Bago,
>>
>> Finally I am able to create one which fails consistently. I think the
>> issue
>> is caused by the VectorAssembler in the model. In the new code, I have 2
>> features(1 text and 1 number) and I have to run through a VectorAssembler
>> before giving to LogisticRegression. Code and test data below
>>
>> import java.util.Arrays;
>> import java.util.List;
>> import org.apache.spark.ml.Pipeline;
>> import org.apache.spark.ml.PipelineModel;
>> import org.apache.spark.ml.PipelineStage;
>> import org.apache.spark.ml.classification.LogisticRegression;
>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
>> import org.apache.spark.ml.feature.CountVectorizer;
>> import org.apache.spark.ml.feature.CountVectorizerModel;
>> import org.apache.spark.ml.feature.IndexToString;
>> import org.apache.spark.ml.feature.StringIndexer;
>> import org.apache.spark.ml.feature.StringIndexerModel;
>> import org.apache.spark.ml.feature.Tokenizer;
>> import org.apache.spark.ml.feature.VectorAssembler;
>> import org.apache.spark.ml.param.ParamMap;
>> import org.apache.spark.ml.tuning.ParamGridBuilder;
>> import org.apache.spark.ml.tuning.TrainValidationSplit;
>> import org.apache.spark.ml.tuning.TrainValidationSplitModel;
>> import org.apache.spark.sql.Dataset;
>> import org.apache.spark.sql.Row;
>> import org.apache.spark.sql.RowFactory;
>> import org.apache.spark.sql.SparkSession;
>> import org.apache.spark.sql.streaming.StreamingQuery;
>> import org.apache.spark.sql.types.DataTypes;
>> import org.apache.spark.sql.types.Metadata;
>> import org.apache.spark.sql.types.StructField;
>> import org.apache.spark.sql.types.StructType;
>>
>> /**
>>  * A simple text classification pipeline that recognizes "spark" from
>> input
>> text.
>>  */
>> public class StreamingIssueCountVectorizerSplitFailed {
>>
>>   public static void main(String[] args) throws Exception {
>> SparkSession sparkSession =
>> SparkSession.builder().appName("StreamingIssueCountVectorizer")
>> .master("local[2]")
>> .getOrCreate();
>>
>> List _trainData = Arrays.asList(
>> RowFactory.create("sunny fantastic day", 1, "Positive"),
>> RowFactory.create("fantastic morning match", 1, "Positive"),
>> RowFactory.create("good morning", 1, "Positive"),
>> RowFactory.create("boring evening", 5, "Negative"),
>> RowFactory.create("tragic evening event", 5, "Negative"),
>> RowFactory.create("today is bad ", 5, "Negative")
>> );
>> List _testData = Arrays.asList(
>> RowFactory.create("sunny morning", 1),
>> RowFactory.create("bad evening", 5)
>> );
>> StructType schema = new StructType(new StructField[]{
>> new StructField("tweet", DataTypes.StringType, false,
>> Metadata.empty()),
>> new StructField("time", DataTypes.IntegerType, false,
>> Metadata.empty()),
>> new StructField("sentiment", DataTypes.StringType, true,
>> Metadata.empty())
>> });
>> StructType testSchema = new StructType(new StructField[]{
>> new StructField("tweet", DataTypes.StringType, false,
>> Metadata.empty()),
>> new StructField("time", DataTypes.IntegerType, false,
>> Metadata.empty())
>> });
>>
>> Dataset trainData = sparkSession.createDataFrame(_trainData,
>> schema);
>> Dataset testData = sparkSession.createDataFrame(_testData,
>> testSchema);
>> StringIndexerModel labelIndexerModel = new StringIndexer()
>> .setInputCol("sentiment")
>> .setOutputCol("label")
>> .setHandleInvalid("skip")
>> .fit(trainData);
>> Tokenizer tokenizer = new Tokenizer()
>> .setInputCol("tweet")
>> .setOutputCol("words");
>> CountVectorizer countVectorizer = new CountVectorizer()
>> .setInputCol(tokenizer.getOutputCol())
>> .setOutputCol("wordfeatures")
>> .setVocabSize(3)
>> .setMinDF(2)
>> .setMinTF(2)
>> .setBinary(true);
>>
>> VectorAssembler vectorAssembler = new VectorAssembler()
>> .setInputCols(ne