Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe : - "spark.sql.files.ignoreCorruptFiles" option - DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat (this is available from Spark 2.2.0). For example, val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd",

Re: Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread lk_spark
Tank you , that's what I want to confirm. 2017-03-16 lk_spark 发件人:Yuhao Yang 发送时间:2017-03-16 13:05 主题:Re: Re: how to call recommend method from ml.recommendation.ALS 收件人:"lk_spark" 抄送:"任弘迪","user.spark" This

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Rohit Karlupia
Number of tasks is very likely not the reason for getting timeouts. Few things to look for: What is actually timing out? What kind of operation? Writing/Reading to HSDF (NameNode or DataNode) or fetching shuffle data (External Shuffle Service or not) or driver is not able to talk to executor.

Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread Yuhao Yang
This is something that was just added to ML and will probably be released with 2.2. For now you can try to copy from the master code: https://github.com/apache/spark/blob/70f9d7f71c63d2b1fdfed75cb7a59285c272a62b/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L352 and give it a

Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread lk_spark
thanks for your reply , what I exactly want to know is : in package mllib.recommendation , MatrixFactorizationModel have method like recommendProducts , but I didn't find it in package ml.recommendation. how can I do the samething as mllib when I use ml. 2017-03-16 lk_spark 发件人:任弘迪

RE: Fast write datastore...

2017-03-15 Thread jasbir.sing
Hi, Will MongoDB not fit this solution? From: Vova Shelgunov [mailto:vvs...@gmail.com] Sent: Wednesday, March 15, 2017 11:51 PM To: Muthu Jayakumar Cc: vincent gromakowski ; Richard Siebeling ; user

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
>Reading your original question again, it seems to me probably you don't need a fast data store Shiva, You are right. I only asked about fast-write and never mentioned on read :). For us, Cassandra may not be a choice of read because of its a. limitations on pagination support on the server side

Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread 任弘迪
if the num of user-item pairs to predict aren't too large, say millions, you could transform the target dataframe and save the result to a hive table, then build cache based on that table for online services. if it's not the case(such as billions of user item pairs to predict), you have to start

Re: Fast write datastore...

2017-03-15 Thread Shiva Ramagopal
Hi, The choice of ES vs Cassandra should really be made depending on your query use-cases. ES and Cassandra have their own strengths which should be matched to what you want to do rather than making a choice based on their respective feature sets. Reading your original question again, it seems

Re: Fast write datastore...

2017-03-15 Thread Koert Kuipers
we are using elasticsearch for this. the issue of elasticsearch falling over if the number of partitions/cores in spark writing to it is too high does suck indeed. and the answer every time i asked about it on elasticsearch mailing list has been to reduce spark tasks or increase elasticsearch

Re: apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread Yong Zhang
Is the answer here good for your case? http://stackoverflow.com/questions/33151866/spark-udf-with-varargs [https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded] scala - Spark UDF with varargs -

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Yong Zhang
Not really sure what is the root problem you try to address. The number of tasks need to be run in Spark depends on the number of partitions in your job. Let's use a simple word count example, if your spark job read 128G data from HDFS (assume the default block size is 128M), then the mapper

how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread lk_spark
hi,all: under spark2.0 ,I wonder to know after trained a ml.recommendation.ALSModel how I can do the recommend action? I try to save the model and load it by MatrixFactorizationModel but got error. 2017-03-16 lk_spark

Re: Fast write datastore...

2017-03-15 Thread Vova Shelgunov
Hi Muthu,. I did not catch from your message, what performance do you expect from subsequent queries? Regards, Uladzimir On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" wrote: > Hello Uladzimir / Shiva, > > From ElasticSearch documentation (i have to see the logical plan of a >

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
Hello Uladzimir / Shiva, >From ElasticSearch documentation (i have to see the logical plan of a query to confirm), the richness of filters (like regex,..) is pretty good while comparing to Cassandra. As for aggregates, i think Spark Dataframes is quite rich enough to tackle. Let me know your

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini, We set that parameter before we went and played with the number of executors and that didn't seem to help at all. Thanks, KP On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar wrote: > Hi, > > try using this parameter --conf spark.sql.shuffle.partitions=1000

Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Jörn Franke
Hi, The Spark CSV parser has different parsing modes: * permissive (default) tries to read everything and missing tokens are interpreted as null and extra tokens are ignored * dropmalformed drops lines which have more or less tokens * failfast - runtimexception if there is a malformed line

[Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Nathan Case
Accidentally sent this to the dev mailing list, meant to send it here. I have a spark java application that in the past has used the hadoopFile interface to specify a custom TextInputFormat to be used when reading files. This custom class would gracefully handle exceptions like EOF exceptions

Re: Setting spark.yarn.stagingDir in 1.6

2017-03-15 Thread Vikash Pareek
++ Sudhir On Wed, 15 Mar 2017 at 4:06 PM, Saurav Sinha wrote: > Hi Users, > > > I am running spark job in yarn. > > I want to set staging directory to some other location which is by default > hdfs://host:port/home/$User/ > > In spark 2.0.0, it can be done by setting

Re: Fast write datastore...

2017-03-15 Thread Shiva Ramagopal
Probably Cassandra is a good choice if you are mainly looking for a datastore that supports fast writes. You can ingest the data into a table and define one or more materialized views on top of it to support your queries. Since you mention that your queries are going to be simple you can define

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
Hello Vincent, Cassandra may not fit my bill if I need to define my partition and other indexes upfront. Is this right? Hello Richard, Let me evaluate Apache Ignite. I did evaluate it 3 months back and back then the connector to Apache Spark did not support Spark 2.0. Another drastic thought

Thrift Server as JDBC endpoint

2017-03-15 Thread Sebastian Piu
Hi all, I'm doing some research on best ways to expose data created by some of our spark jobs so that they can be consumed by a client (A Web UI). The data we need to serve might be huge but we can control the type of queries that are submitted e.g.: * Limit number of results * only accept

Re: Fast write datastore...

2017-03-15 Thread Richard Siebeling
maybe Apache Ignite does fit your requirements On 15 March 2017 at 08:44, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > Hi > If queries are statics and filters are on the same columns, Cassandra is a > good option. > > Le 15 mars 2017 7:04 AM, "muthu" a écrit

Setting spark.yarn.stagingDir in 1.6

2017-03-15 Thread Saurav Sinha
Hi Users, I am running spark job in yarn. I want to set staging directory to some other location which is by default hdfs://host:port/home/$User/ In spark 2.0.0, it can be done by setting spark.yarn.stagingDir. But in production, we have spark 1.6. Can anyone please suggest how it can be done

Re: Fast write datastore...

2017-03-15 Thread vincent gromakowski
Hi If queries are statics and filters are on the same columns, Cassandra is a good option. Le 15 mars 2017 7:04 AM, "muthu" a écrit : Hello there, I have one or more parquet files to read and perform some aggregate queries using Spark Dataframe. I would like to find a

Re: Scaling Kafka Direct Streming application

2017-03-15 Thread vincent gromakowski
You would probably need dynamic allocation which is only available on yarn and mesos. Or wait for on going spark k8s integration Le 15 mars 2017 1:54 AM, "Pranav Shukla" a écrit : > How to scale or possibly auto-scale a spark streaming application > consuming from

Re: apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread Hongdi Ren
Since N is decided at runtime, the first idea come to my mind is transform the columns into one vector column (VectorIndexer can do that) and then let udf handle the vector. Just like many ml transformers do. From: anup ahire Date: Wednesday, March 15, 2017 at 2:04 PM

Spark SQL Skip and Log bad records

2017-03-15 Thread Aviral Agarwal
Hi guys, Is there a way to skip some bad records and log them when using DataFrame API ? Thanks and Regards, Aviral Agarwal

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-15 Thread Yuhao Yang
Hi Julian, Thanks for reporting this. This is a valid issue and I created https://issues.apache.org/jira/browse/SPARK-19957 to track it. Right now the seed is set to this.getClass.getName.hashCode.toLong by default, which indeed keeps the same among multiple fits. Feel free to leave your

apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread anup ahire
Hello, I have a schema and name of columns to apply UDF to. Name of columns are user input and they can differ in numbers for each input. Is there a way to apply UDFs to N columns in dataframe ? Thanks !

Fast write datastore...

2017-03-15 Thread muthu
Hello there, I have one or more parquet files to read and perform some aggregate queries using Spark Dataframe. I would like to find a reasonable fast datastore that allows me to write the results for subsequent (simpler queries). I did attempt to use ElasticSearch to write the query results