Re: spark2.4 arrow enabled true,error log not returned

2019-01-10 Thread Bryan Cutler
Hi, could you please clarify if you are running a YARN cluster when you see this problem? I tried on Spark standalone and could not reproduce. If it's on a YARN cluster, please file a JIRA and I can try to investigate further. Thanks, Bryan On Sat, Dec 15, 2018 at 3:42 AM 李斌松 wrote: >

Remote Data Read Time

2019-01-10 Thread swastik mittal
I was working with custom spark listener library. There, I am not able to figure out a way to break into the details of task. I only have a listener which runs on task start, But I want to calculate the time my executor took to read input data from remote data source for that task, but as spark

Re: Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
I actually tried that first. I moved away from it because the algorithm needs to evaluate all records for all models, for instance, a model trained on (2,4) needs to be evaluated on a record whose true label is 8. I found that if I apply the filter in the label-creation transformer, then a record

Re: Spark ML with null labels

2019-01-10 Thread Xiangrui Meng
In your custom transformer that produces labels, can you filter null labels? A transformer doesn't always need to do 1:1 mapping. On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy I'm trying to implement an algorithm on the MNIST digits that runs like so: > > >- for every pair of digits (0,1),

Re: Performance Issue

2019-01-10 Thread Gourav Sengupta
Hi Tzahi, by using GROUP BY without any aggregate columns are you just trying to find out the DISTINCT of the columns ? Also it may be of help (I do not know whether the SQL optimiser automatically takes care of this) to have the LEFT JOIN on a smaller data set by having joined on the device_id

Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
I'm trying to implement an algorithm on the MNIST digits that runs like so: - for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total - Fit every classifier on the test set separately - Aggregate the

Re: Reading as Parquet a directory created by Spark Structured Streaming - problems

2019-01-10 Thread ddebarbieux
cala> spark.read.schema(StructType(Seq(StructField("_1",StringType,false), StructField("_2",StringType,true.parque ("hdfs://---/MY_DIRECTORY/*_1=201812030900*").show() +++ | _1| _2| +++ |null|ba1ca2dc033440125...| |null|ba1ca2dc033440125...|

Re: Performance Issue

2019-01-10 Thread Tzahi File
Hi Gourav, My version of Spark is 2.1. The data is stored on S3 directory in parquet format. I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed): SELECT * FROM (select * from parquet_files.raw_e as re