Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe :

- "spark.sql.files.ignoreCorruptFiles" option

- DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat
(this is available from Spark 2.2.0).

For example,

val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd",
  classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
  classOf[org.apache.hadoop.io.LongWritable],
  classOf[org.apache.hadoop.io.Text])
val stringRdd = rdd.map(pair => new String(pair._2.getBytes, 0,
pair._2.getLength))

spark.read.csv(stringRdd.toDS)

​


2017-03-16 2:11 GMT+09:00 Jörn Franke :

> Hi,
>
> The Spark CSV parser has different parsing modes:
> * permissive (default) tries to read everything and missing tokens are
> interpreted as null and extra tokens are ignored
> * dropmalformed drops lines which have more or less tokens
> * failfast - runtimexception if there is a malformed line
> Obvious this does not handle malformed gzip (you may ask the sender of the
> gzip to improve their application).
>
> You can adapt the line you mentioned (not sure which Spark version this
> is), but you may not want to do it, because this would mean to maintain an
> own Spark version.
>
> You can write your own datasource (i.e. different namespace than Spark
> CSV) Then, you can also think about a lot of optimisations compared to the
> Spark csv parser, which - depending on the csv and your analysis needs -
> can make processing much more faster.
>
> You could also add a new compressioncodec that ignores broken gzips. In
> this case you would not need an own data source.
>
> Best regards
>
> On 15 Mar 2017, at 16:56, Nathan Case  wrote:
>
> Accidentally sent this to the dev mailing list, meant to send it here.
>
> I have a spark java application that in the past has used the hadoopFile
> interface to specify a custom TextInputFormat to be used when reading
> files.  This custom class would gracefully handle exceptions like EOF
> exceptions caused by corrupt gzip files in the input data.  I recently
> switched to using the csv parser built into spark but am now faced with the
> problem that anytime a bad input file is encountered my whole job fails.
>
> My code to load the data using csv is:
>
> Dataset csv = sparkSession.read()
> .option("delimiter", parseSettings.getDelimiter().toString())
> .option("quote", parseSettings.getQuote())
> .option("parserLib", "UNIVOCITY")
> .csv(paths);
>
> Previously I would load the data using:
>
> JavaRDD lines = sc.newAPIHadoopFile(filePaths, NewInputFormat.class,
> LongWritable.class, Text.class, sc.hadoopConfiguration())
> .values()
> .map(Text::toString);
>
>
> Looking at the CSVFileFormat.scala class it looks like in the private
> readText method if I would overwrite where it passes TextInputFormat to the
> hadoopFile method with my customer format I would be able to achieve what I
> want.
>
> private def readText(
> sparkSession: SparkSession,
> options: CSVOptions,
> location: String): RDD[String] = {
>   if (Charset.forName(options.charset) == StandardCharsets.UTF_8) {
> sparkSession.sparkContext.textFile(location)
>   } else {
> val charset = options.charset
> sparkSession.sparkContext
>
>// This is where I would want to be able to specify my
>
>// input format instead of TextInputFormat
>
>   .hadoopFile[LongWritable, Text, TextInputFormat](location)
>   .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, 
> pair._2.getLength, charset)))
>   }
> }
>
>
> Does anyone know if there is another way to prevent the corrupt files from
> failing my job or could help to make the required changes to make the
> TextInputFormat customizable as I have only just started looking at scala.
>
> Thanks,
> Nathan
>
>


Re: Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread lk_spark
Tank you , that's what I want to confirm.

2017-03-16 

lk_spark 



发件人:Yuhao Yang 
发送时间:2017-03-16 13:05
主题:Re: Re: how to call recommend method from ml.recommendation.ALS
收件人:"lk_spark"
抄送:"任弘迪","user.spark"

This is something that was just added to ML and will probably be released with 
2.2. For now you can try to copy from the master code: 
https://github.com/apache/spark/blob/70f9d7f71c63d2b1fdfed75cb7a59285c272a62b/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L352
 and give it a try.


Yuhao


2017-03-15 21:39 GMT-07:00 lk_spark :

thanks for your reply , what I exactly want to know is :
in package mllib.recommendation  , MatrixFactorizationModel have method like 
recommendProducts , but I didn't find it in package ml.recommendation.
how can I do the samething as mllib when I use ml. 
2017-03-16 

lk_spark 



发件人:任弘迪 
发送时间:2017-03-16 10:46
主题:Re: how to call recommend method from ml.recommendation.ALS
收件人:"lk_spark"
抄送:"user.spark"

if the num of user-item pairs to predict aren't too large, say millions, you 
could transform the target dataframe and save the result to a hive table, then 
build cache based on that table for online services. 


if it's not the case(such as billions of user item pairs to predict), you have 
to start a service with the model loaded, send user to the service, first match 
several hundreds of items from all items available which could itself be 
another service or cache, then transform this user and all items using the 
model to get prediction, and return items ordered by prediction.


On Thu, Mar 16, 2017 at 9:32 AM, lk_spark  wrote:

hi,all:
   under spark2.0 ,I wonder to know after trained a 
ml.recommendation.ALSModel how I can do the recommend action?

   I try to save the model and load it by MatrixFactorizationModel but got 
error.

2017-03-16


lk_spark 

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Rohit Karlupia
Number of tasks is very likely not the reason for getting timeouts. Few
things to look for:

What is actually timing out? What kind of operation?
Writing/Reading to HSDF (NameNode or DataNode)
or fetching shuffle data (External Shuffle Service or not)
or driver is not able to talk to executor.

Trivial of things to do is to increase network timeouts in spark conf.
Other thing to check is: if GC is kicking in -- try increasing heap size.

thanks & all the best,
rohitk



On Thu, Mar 16, 2017 at 7:23 AM, Yong Zhang  wrote:

> Not really sure what is the root problem you try to address.
>
>
> The number of tasks need to be run in Spark depends on the number of
> partitions in your job.
>
>
> Let's use a simple word count example, if your spark job read 128G data
> from HDFS (assume the default block size is 128M), then the mapper stage of
> your spark job will spawn 1000 tasks (128G / 128M).
>
>
> In the reducer stage, by default, spark will spawn 200 tasks (controlled
> by spark.default.parallelism if you are using RDD api or
> spark.sql.shuffle.partitions if you are using DataFrame, and you didn't
> specify the partition number in any of your API call).
>
>
> In either case, you can change the tasks number spawned (Even in the
> mapper case, but I didn't see any reason under normal case). For huge
> datasets running in Spark, people often to increase the tasks count spawned
> in the reducing stage, to make each task processing much less volume of
> data, to reduce the memory pressure and increase performance.
>
>
> Still in the word count example, if you have 2000 unique words in your
> dataset, then your reducer count could be from 1 to 2000. 1 is the worst,
> as only one task will process all 2000 unique words, meaning all the data
> will be sent to this one task, and it will be the slowest. But on the other
> hand, 2000 maybe is neither the best.
>
>
> Let's say we set 200 is the best number, so you will have 200 reduce tasks
> to process 2000 unique words. Setting the number of executors and cores is
> just to allocation how many these tasks can be run concurrently. So if your
> cluster has enough cores and memory available, obviously grant as many as
> cores up to 200 to your spark job for this reducing stage is the best.
>
>
> You need to be more clear about what problem you are facing when running
> your spark job here, so we can provide help. Reducing the number of tasks
> spawned normally is a very strange way.
>
>
> Yong
>
>
> --
> *From:* Kevin Peng 
> *Sent:* Wednesday, March 15, 2017 1:35 PM
> *To:* mohini kalamkar
> *Cc:* user@spark.apache.org
> *Subject:* Re: Setting Optimal Number of Spark Executor Instances
>
> Mohini,
>
> We set that parameter before we went and played with the number of
> executors and that didn't seem to help at all.
>
> Thanks,
>
> KP
>
> On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar <
> mohini.kalam...@gmail.com> wrote:
>
>> Hi,
>>
>> try using this parameter --conf spark.sql.shuffle.partitions=1000
>>
>> Thanks,
>> Mohini
>>
>> On Tue, Mar 14, 2017 at 3:30 PM, kpeng1  wrote:
>>
>>> Hi All,
>>>
>>> I am currently on Spark 1.6 and I was doing a sql join on two tables that
>>> are over 100 million rows each and I noticed that it was spawn 3+
>>> tasks
>>> (this is the progress meter that we are seeing show up).  We tried to
>>> coalesece, repartition and shuffle partitions to drop the number of tasks
>>> down because we were getting time outs due to the number of task being
>>> spawned, but those operations did not seem to reduce the number of tasks.
>>> The solution we came up with was actually to set the num executors to 50
>>> (--num-executors=50) and it looks like it spawned 200 active tasks, but
>>> the
>>> total number of tasks remained the same.  Was wondering if anyone knows
>>> what
>>> is going on?  Is there an optimal number of executors, I was under the
>>> impression that the default dynamic allocation would pick the optimal
>>> number
>>> of executors for us and that this situation wouldn't happen.  Is there
>>> something I am missing?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Setting-Optimal-Number-of-Spark-Execut
>>> or-Instances-tp28493.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Mohini Kalamkar
>> M: +1 310 567 9329 <(310)%20567-9329>
>>
>
>


Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread Yuhao Yang
This is something that was just added to ML and will probably be released
with 2.2. For now you can try to copy from the master code:
https://github.com/apache/spark/blob/70f9d7f71c63d2b1fdfed75cb7a59285c272a62b/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L352
and give it a try.

Yuhao

2017-03-15 21:39 GMT-07:00 lk_spark :

> thanks for your reply , what I exactly want to know is :
> in package mllib.recommendation  , MatrixFactorizationModel have method
> like recommendProducts , but I didn't find it in package ml.recommendation.
> how can I do the samething as mllib when I use ml.
> 2017-03-16
> --
> lk_spark
> --
>
> *发件人:*任弘迪 
> *发送时间:*2017-03-16 10:46
> *主题:*Re: how to call recommend method from ml.recommendation.ALS
> *收件人:*"lk_spark"
> *抄送:*"user.spark"
>
> if the num of user-item pairs to predict aren't too large, say millions,
> you could transform the target dataframe and save the result to a hive
> table, then build cache based on that table for online services.
>
> if it's not the case(such as billions of user item pairs to predict), you
> have to start a service with the model loaded, send user to the service,
> first match several hundreds of items from all items available which could
> itself be another service or cache, then transform this user and all items
> using the model to get prediction, and return items ordered by prediction.
>
> On Thu, Mar 16, 2017 at 9:32 AM, lk_spark  wrote:
>
>> hi,all:
>>under spark2.0 ,I wonder to know after trained a
>> ml.recommendation.ALSModel how I can do the recommend action?
>>
>>I try to save the model and load it by MatrixFactorizationModel
>> but got error.
>>
>> 2017-03-16
>> --
>> lk_spark
>>
>
>


Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread lk_spark
thanks for your reply , what I exactly want to know is :
in package mllib.recommendation  , MatrixFactorizationModel have method like 
recommendProducts , but I didn't find it in package ml.recommendation.
how can I do the samething as mllib when I use ml. 
2017-03-16 

lk_spark 



发件人:任弘迪 
发送时间:2017-03-16 10:46
主题:Re: how to call recommend method from ml.recommendation.ALS
收件人:"lk_spark"
抄送:"user.spark"

if the num of user-item pairs to predict aren't too large, say millions, you 
could transform the target dataframe and save the result to a hive table, then 
build cache based on that table for online services.


if it's not the case(such as billions of user item pairs to predict), you have 
to start a service with the model loaded, send user to the service, first match 
several hundreds of items from all items available which could itself be 
another service or cache, then transform this user and all items using the 
model to get prediction, and return items ordered by prediction.


On Thu, Mar 16, 2017 at 9:32 AM, lk_spark  wrote:

hi,all:
   under spark2.0 ,I wonder to know after trained a 
ml.recommendation.ALSModel how I can do the recommend action?

   I try to save the model and load it by MatrixFactorizationModel but got 
error.

2017-03-16


lk_spark 

RE: Fast write datastore...

2017-03-15 Thread jasbir.sing
Hi,

Will MongoDB not fit this solution?



From: Vova Shelgunov [mailto:vvs...@gmail.com]
Sent: Wednesday, March 15, 2017 11:51 PM
To: Muthu Jayakumar 
Cc: vincent gromakowski ; Richard Siebeling 
; user ; Shiva Ramagopal 

Subject: Re: Fast write datastore...

Hi Muthu,.

I did not catch from your message, what performance do you expect from 
subsequent queries?

Regards,
Uladzimir

On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" 
> wrote:
Hello Uladzimir / Shiva,

From ElasticSearch documentation (i have to see the logical plan of a query to 
confirm), the richness of filters (like regex,..) is pretty good while 
comparing to Cassandra. As for aggregates, i think Spark Dataframes is quite 
rich enough to tackle.
Let me know your thoughts.

Thanks,
Muthu


On Wed, Mar 15, 2017 at 10:55 AM, vvshvv 
> wrote:
Hi muthu,

I agree with Shiva, Cassandra also supports SASI indexes, which can partially 
replace Elasticsearch functionality.

Regards,
Uladzimir



Sent from my Mi phone
On Shiva Ramagopal >, Mar 15, 2017 
5:57 PM wrote:
Probably Cassandra is a good choice if you are mainly looking for a datastore 
that supports fast writes. You can ingest the data into a table and define one 
or more materialized views on top of it to support your queries. Since you 
mention that your queries are going to be simple you can define your indexes in 
the materialized views according to how you want to query the data.
Thanks,
Shiva


On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
> wrote:
Hello Vincent,

Cassandra may not fit my bill if I need to define my partition and other 
indexes upfront. Is this right?

Hello Richard,

Let me evaluate Apache Ignite. I did evaluate it 3 months back and back then 
the connector to Apache Spark did not support Spark 2.0.

Another drastic thought may be repartition the result count to 1 (but have to 
be cautions on making sure I don't run into Heap issues if the result is too 
large to fit into an executor)  and write to a relational database like mysql / 
postgres. But, I believe I can do the same using ElasticSearch too.

A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

More thoughts welcome please.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
> wrote:
maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski 
> wrote:
Hi
If queries are statics and filters are on the same columns, Cassandra is a good 
option.

Le 15 mars 2017 7:04 AM, "muthu" 
> a écrit :
Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org








This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and 

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
>Reading your original question again, it seems to me probably you don't
need a fast data store
Shiva, You are right. I only asked about fast-write and never mentioned on
read :). For us, Cassandra may not be a choice of read because of its
a. limitations on pagination support on the server side
b. richness of filters provided when compared to elastic search... but this
can worked around by using spark dataframe.
c. a possible larger limitation for me, which is mandate on creating a
partition key column before hand. I may not be able to determine this
before hand.
But 'materialized view', 'SSTable Attached Secondary Index (SASI)' can help
alleviate to some extent.

>what performance do you expect from subsequent queries?
Uladzimir, here is what we do now...
Step 1: Run aggregate query using large number of parquets (generally
ranging from few MBs to few GBs) using Spark Dataframe.
Step 2: Attempt to store these query results in a 'fast datastore' (I have
asked for recommendations in this post). The data is usually sized from
250K to 600 million rows... Also the schema from Step 1 is not known before
hand and is usually deduced from the Dataframe schema or so. In most cases
it's a simple non-structural field.
Step 3: Run one or more queries from results stored in Step 2... These are
something as simple as pagination, filters (think of it as simple string
contains, regex, number in range, ...) and sort. For any operation more
complex than this, I have been planning to run it thru a dataframe.

Koert makes valid points on the issues with Elastic Search.

On a side note, we do use Cassandra for Spark Streaming use-cases where we
sink the data into Cassandra (for efficient upsert capabilities) and
eventually write into parquet for long term storage and trend analysis with
full table scans scenarios.

But I am thankful for many ideas and perspectives on how this could be
looked at.

Thanks,
Muthu


On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal  wrote:

> Hi,
>
> The choice of ES vs Cassandra should really be made depending on your
> query use-cases. ES and Cassandra have their own strengths which should be
> matched to what you want to do rather than making a choice based on their
> respective feature sets.
>
> Reading your original question again, it seems to me probably you don't
> need a fast data store since you are doing a batch-like processing (reading
> from Parquet files) and it is possibly to control this part fully. And it
> also seems like you want to use ES. You can try to reduce the number of
> Spark executors to throttle the writes to ES.
>
> -Shiva
>
> On Wed, Mar 15, 2017 at 11:32 PM, Muthu Jayakumar 
> wrote:
>
>> Hello Uladzimir / Shiva,
>>
>> From ElasticSearch documentation (i have to see the logical plan of a
>> query to confirm), the richness of filters (like regex,..) is pretty good
>> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
>> is quite rich enough to tackle.
>> Let me know your thoughts.
>>
>> Thanks,
>> Muthu
>>
>>
>> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>>
>>> Hi muthu,
>>>
>>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>>> partially replace Elasticsearch functionality.
>>>
>>> Regards,
>>> Uladzimir
>>>
>>>
>>>
>>> Sent from my Mi phone
>>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>>
>>> Probably Cassandra is a good choice if you are mainly looking for a
>>> datastore that supports fast writes. You can ingest the data into a table
>>> and define one or more materialized views on top of it to support your
>>> queries. Since you mention that your queries are going to be simple you can
>>> define your indexes in the materialized views according to how you want to
>>> query the data.
>>>
>>> Thanks,
>>> Shiva
>>>
>>>
>>>
>>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>>> wrote:
>>>
 Hello Vincent,

 Cassandra may not fit my bill if I need to define my partition and
 other indexes upfront. Is this right?

 Hello Richard,

 Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
 then the connector to Apache Spark did not support Spark 2.0.

 Another drastic thought may be repartition the result count to 1 (but
 have to be cautions on making sure I don't run into Heap issues if the
 result is too large to fit into an executor)  and write to a relational
 database like mysql / postgres. But, I believe I can do the same using
 ElasticSearch too.

 A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

 More thoughts welcome please.

 Thanks,
 Muthu

 On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling <
 rsiebel...@gmail.com> wrote:

> maybe Apache Ignite does fit your requirements
>
> On 15 March 2017 at 08:44, vincent gromakowski <
> 

Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread 任弘迪
if the num of user-item pairs to predict aren't too large, say millions,
you could transform the target dataframe and save the result to a hive
table, then build cache based on that table for online services.

if it's not the case(such as billions of user item pairs to predict), you
have to start a service with the model loaded, send user to the service,
first match several hundreds of items from all items available which could
itself be another service or cache, then transform this user and all items
using the model to get prediction, and return items ordered by prediction.

On Thu, Mar 16, 2017 at 9:32 AM, lk_spark  wrote:

> hi,all:
>under spark2.0 ,I wonder to know after trained a
> ml.recommendation.ALSModel how I can do the recommend action?
>
>I try to save the model and load it by MatrixFactorizationModel but
> got error.
>
> 2017-03-16
> --
> lk_spark
>


Re: Fast write datastore...

2017-03-15 Thread Shiva Ramagopal
Hi,

The choice of ES vs Cassandra should really be made depending on your query
use-cases. ES and Cassandra have their own strengths which should be
matched to what you want to do rather than making a choice based on their
respective feature sets.

Reading your original question again, it seems to me probably you don't
need a fast data store since you are doing a batch-like processing (reading
from Parquet files) and it is possibly to control this part fully. And it
also seems like you want to use ES. You can try to reduce the number of
Spark executors to throttle the writes to ES.

-Shiva

On Wed, Mar 15, 2017 at 11:32 PM, Muthu Jayakumar 
wrote:

> Hello Uladzimir / Shiva,
>
> From ElasticSearch documentation (i have to see the logical plan of a
> query to confirm), the richness of filters (like regex,..) is pretty good
> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
> is quite rich enough to tackle.
> Let me know your thoughts.
>
> Thanks,
> Muthu
>
>
> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>
>> Hi muthu,
>>
>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>> partially replace Elasticsearch functionality.
>>
>> Regards,
>> Uladzimir
>>
>>
>>
>> Sent from my Mi phone
>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>
>> Probably Cassandra is a good choice if you are mainly looking for a
>> datastore that supports fast writes. You can ingest the data into a table
>> and define one or more materialized views on top of it to support your
>> queries. Since you mention that your queries are going to be simple you can
>> define your indexes in the materialized views according to how you want to
>> query the data.
>>
>> Thanks,
>> Shiva
>>
>>
>>
>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>> wrote:
>>
>>> Hello Vincent,
>>>
>>> Cassandra may not fit my bill if I need to define my partition and other
>>> indexes upfront. Is this right?
>>>
>>> Hello Richard,
>>>
>>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>>> then the connector to Apache Spark did not support Spark 2.0.
>>>
>>> Another drastic thought may be repartition the result count to 1 (but
>>> have to be cautions on making sure I don't run into Heap issues if the
>>> result is too large to fit into an executor)  and write to a relational
>>> database like mysql / postgres. But, I believe I can do the same using
>>> ElasticSearch too.
>>>
>>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>>
>>> More thoughts welcome please.
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling >> > wrote:
>>>
 maybe Apache Ignite does fit your requirements

 On 15 March 2017 at 08:44, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra
> is a good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate
> queries
> using Spark Dataframe. I would like to find a reasonable fast
> datastore that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to
> handle.
> But in the schema sense, this seems a great fit as ElasticSearch has
> smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex
> aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

>>>
>>
>


Re: Fast write datastore...

2017-03-15 Thread Koert Kuipers
we are using elasticsearch for this.

the issue of elasticsearch falling over if the number of partitions/cores
in spark writing to it is too high does suck indeed. and the answer every
time i asked about it on elasticsearch mailing list has been to reduce
spark tasks or increase elasticsearch nodes, which is not very useful.

we ended up putting the spark jobs that write to elasticsearch on a yarn
queue that limits cores. not ideal but it does the job.

On Wed, Mar 15, 2017 at 2:04 AM, muthu  wrote:

> Hello there,
>
> I have one or more parquet files to read and perform some aggregate queries
> using Spark Dataframe. I would like to find a reasonable fast datastore
> that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to handle.
> But in the schema sense, this seems a great fit as ElasticSearch has smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread Yong Zhang
Is the answer here good for your case?


http://stackoverflow.com/questions/33151866/spark-udf-with-varargs

[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded]

scala - Spark UDF with varargs - Stack 
Overflow
stackoverflow.com
UDFs don't support varargs* but you can pass an arbitrary number of columns 
wrapped using an array function: import org.apache.spark.sql.functions.{udf, 
array, lit ...






From: anup ahire 
Sent: Wednesday, March 15, 2017 2:04 AM
To: user@spark.apache.org
Subject: apply UDFs to N columns dynamically in dataframe

Hello,

I have a schema and name of columns to apply UDF to. Name of columns are user 
input and they can differ in numbers for each input.

Is there a way to apply UDFs to N columns in dataframe  ?



Thanks !


Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Yong Zhang
Not really sure what is the root problem you try to address.


The number of tasks need to be run in Spark depends on the number of partitions 
in your job.


Let's use a simple word count example, if your spark job read 128G data from 
HDFS (assume the default block size is 128M), then the mapper stage of your 
spark job will spawn 1000 tasks (128G / 128M).


In the reducer stage, by default, spark will spawn 200 tasks (controlled by 
spark.default.parallelism if you are using RDD api or 
spark.sql.shuffle.partitions if you are using DataFrame, and you didn't specify 
the partition number in any of your API call).


In either case, you can change the tasks number spawned (Even in the mapper 
case, but I didn't see any reason under normal case). For huge datasets running 
in Spark, people often to increase the tasks count spawned in the reducing 
stage, to make each task processing much less volume of data, to reduce the 
memory pressure and increase performance.


Still in the word count example, if you have 2000 unique words in your dataset, 
then your reducer count could be from 1 to 2000. 1 is the worst, as only one 
task will process all 2000 unique words, meaning all the data will be sent to 
this one task, and it will be the slowest. But on the other hand, 2000 maybe is 
neither the best.


Let's say we set 200 is the best number, so you will have 200 reduce tasks to 
process 2000 unique words. Setting the number of executors and cores is just to 
allocation how many these tasks can be run concurrently. So if your cluster has 
enough cores and memory available, obviously grant as many as cores up to 200 
to your spark job for this reducing stage is the best.


You need to be more clear about what problem you are facing when running your 
spark job here, so we can provide help. Reducing the number of tasks spawned 
normally is a very strange way.


Yong



From: Kevin Peng 
Sent: Wednesday, March 15, 2017 1:35 PM
To: mohini kalamkar
Cc: user@spark.apache.org
Subject: Re: Setting Optimal Number of Spark Executor Instances

Mohini,

We set that parameter before we went and played with the number of executors 
and that didn't seem to help at all.

Thanks,

KP

On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar 
> wrote:
Hi,

try using this parameter --conf spark.sql.shuffle.partitions=1000

Thanks,
Mohini

On Tue, Mar 14, 2017 at 3:30 PM, kpeng1 
> wrote:
Hi All,

I am currently on Spark 1.6 and I was doing a sql join on two tables that
are over 100 million rows each and I noticed that it was spawn 3+ tasks
(this is the progress meter that we are seeing show up).  We tried to
coalesece, repartition and shuffle partitions to drop the number of tasks
down because we were getting time outs due to the number of task being
spawned, but those operations did not seem to reduce the number of tasks.
The solution we came up with was actually to set the num executors to 50
(--num-executors=50) and it looks like it spawned 200 active tasks, but the
total number of tasks remained the same.  Was wondering if anyone knows what
is going on?  Is there an optimal number of executors, I was under the
impression that the default dynamic allocation would pick the optimal number
of executors for us and that this situation wouldn't happen.  Is there
something I am missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Optimal-Number-of-Spark-Executor-Instances-tp28493.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




--
Thanks & Regards,
Mohini Kalamkar
M: +1 310 567 9329



how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread lk_spark
hi,all:
   under spark2.0 ,I wonder to know after trained a 
ml.recommendation.ALSModel how I can do the recommend action?

   I try to save the model and load it by MatrixFactorizationModel but got 
error.

2017-03-16


lk_spark 

Re: Fast write datastore...

2017-03-15 Thread Vova Shelgunov
Hi Muthu,.

I did not catch from your message, what performance do you expect from
subsequent queries?

Regards,
Uladzimir

On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"  wrote:

> Hello Uladzimir / Shiva,
>
> From ElasticSearch documentation (i have to see the logical plan of a
> query to confirm), the richness of filters (like regex,..) is pretty good
> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
> is quite rich enough to tackle.
> Let me know your thoughts.
>
> Thanks,
> Muthu
>
>
> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>
>> Hi muthu,
>>
>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>> partially replace Elasticsearch functionality.
>>
>> Regards,
>> Uladzimir
>>
>>
>>
>> Sent from my Mi phone
>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>
>> Probably Cassandra is a good choice if you are mainly looking for a
>> datastore that supports fast writes. You can ingest the data into a table
>> and define one or more materialized views on top of it to support your
>> queries. Since you mention that your queries are going to be simple you can
>> define your indexes in the materialized views according to how you want to
>> query the data.
>>
>> Thanks,
>> Shiva
>>
>>
>>
>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>> wrote:
>>
>>> Hello Vincent,
>>>
>>> Cassandra may not fit my bill if I need to define my partition and other
>>> indexes upfront. Is this right?
>>>
>>> Hello Richard,
>>>
>>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>>> then the connector to Apache Spark did not support Spark 2.0.
>>>
>>> Another drastic thought may be repartition the result count to 1 (but
>>> have to be cautions on making sure I don't run into Heap issues if the
>>> result is too large to fit into an executor)  and write to a relational
>>> database like mysql / postgres. But, I believe I can do the same using
>>> ElasticSearch too.
>>>
>>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>>
>>> More thoughts welcome please.
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling >> > wrote:
>>>
 maybe Apache Ignite does fit your requirements

 On 15 March 2017 at 08:44, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra
> is a good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate
> queries
> using Spark Dataframe. I would like to find a reasonable fast
> datastore that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to
> handle.
> But in the schema sense, this seems a great fit as ElasticSearch has
> smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex
> aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

>>>
>>
>


Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
Hello Uladzimir / Shiva,

>From ElasticSearch documentation (i have to see the logical plan of a query
to confirm), the richness of filters (like regex,..) is pretty good while
comparing to Cassandra. As for aggregates, i think Spark Dataframes is
quite rich enough to tackle.
Let me know your thoughts.

Thanks,
Muthu


On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:

> Hi muthu,
>
> I agree with Shiva, Cassandra also supports SASI indexes, which can
> partially replace Elasticsearch functionality.
>
> Regards,
> Uladzimir
>
>
>
> Sent from my Mi phone
> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>
> Probably Cassandra is a good choice if you are mainly looking for a
> datastore that supports fast writes. You can ingest the data into a table
> and define one or more materialized views on top of it to support your
> queries. Since you mention that your queries are going to be simple you can
> define your indexes in the materialized views according to how you want to
> query the data.
>
> Thanks,
> Shiva
>
>
>
> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
> wrote:
>
>> Hello Vincent,
>>
>> Cassandra may not fit my bill if I need to define my partition and other
>> indexes upfront. Is this right?
>>
>> Hello Richard,
>>
>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>> then the connector to Apache Spark did not support Spark 2.0.
>>
>> Another drastic thought may be repartition the result count to 1 (but
>> have to be cautions on making sure I don't run into Heap issues if the
>> result is too large to fit into an executor)  and write to a relational
>> database like mysql / postgres. But, I believe I can do the same using
>> ElasticSearch too.
>>
>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>
>> More thoughts welcome please.
>>
>> Thanks,
>> Muthu
>>
>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
>> wrote:
>>
>>> maybe Apache Ignite does fit your requirements
>>>
>>> On 15 March 2017 at 08:44, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 Hi
 If queries are statics and filters are on the same columns, Cassandra
 is a good option.

 Le 15 mars 2017 7:04 AM, "muthu"  a écrit :

 Hello there,

 I have one or more parquet files to read and perform some aggregate
 queries
 using Spark Dataframe. I would like to find a reasonable fast datastore
 that
 allows me to write the results for subsequent (simpler queries).
 I did attempt to use ElasticSearch to write the query results using
 ElasticSearch Hadoop connector. But I am running into connector write
 issues
 if the number of Spark executors are too many for ElasticSearch to
 handle.
 But in the schema sense, this seems a great fit as ElasticSearch has
 smartz
 in place to discover the schema. Also in the query sense, I can perform
 simple filters and sort using ElasticSearch and for more complex
 aggregate,
 Spark Dataframe can come back to the rescue :).
 Please advice on other possible data-stores I could use?

 Thanks,
 Muthu



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org



>>>
>>
>


Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini,

We set that parameter before we went and played with the number of
executors and that didn't seem to help at all.

Thanks,

KP

On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar 
wrote:

> Hi,
>
> try using this parameter --conf spark.sql.shuffle.partitions=1000
>
> Thanks,
> Mohini
>
> On Tue, Mar 14, 2017 at 3:30 PM, kpeng1  wrote:
>
>> Hi All,
>>
>> I am currently on Spark 1.6 and I was doing a sql join on two tables that
>> are over 100 million rows each and I noticed that it was spawn 3+
>> tasks
>> (this is the progress meter that we are seeing show up).  We tried to
>> coalesece, repartition and shuffle partitions to drop the number of tasks
>> down because we were getting time outs due to the number of task being
>> spawned, but those operations did not seem to reduce the number of tasks.
>> The solution we came up with was actually to set the num executors to 50
>> (--num-executors=50) and it looks like it spawned 200 active tasks, but
>> the
>> total number of tasks remained the same.  Was wondering if anyone knows
>> what
>> is going on?  Is there an optimal number of executors, I was under the
>> impression that the default dynamic allocation would pick the optimal
>> number
>> of executors for us and that this situation wouldn't happen.  Is there
>> something I am missing?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Setting-Optimal-Number-of-Spark-Execut
>> or-Instances-tp28493.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Thanks & Regards,
> Mohini Kalamkar
> M: +1 310 567 9329 <(310)%20567-9329>
>


Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Jörn Franke
Hi,

The Spark CSV parser has different parsing modes:
* permissive (default) tries to read everything and missing tokens are 
interpreted as null and extra tokens are ignored
* dropmalformed drops lines which have more or less tokens
* failfast - runtimexception if there is a malformed line
Obvious this does not handle malformed gzip (you may ask the sender of the gzip 
to improve their application).

You can adapt the line you mentioned (not sure which Spark version this is), 
but you may not want to do it, because this would mean to maintain an own Spark 
version.

You can write your own datasource (i.e. different namespace than Spark CSV) 
Then, you can also think about a lot of optimisations compared to the Spark csv 
parser, which - depending on the csv and your analysis needs - can make 
processing much more faster. 

You could also add a new compressioncodec that ignores broken gzips. In this 
case you would not need an own data source.

Best regards

> On 15 Mar 2017, at 16:56, Nathan Case  wrote:
> 
> Accidentally sent this to the dev mailing list, meant to send it here. 
> 
> I have a spark java application that in the past has used the hadoopFile 
> interface to specify a custom TextInputFormat to be used when reading files.  
> This custom class would gracefully handle exceptions like EOF exceptions 
> caused by corrupt gzip files in the input data.  I recently switched to using 
> the csv parser built into spark but am now faced with the problem that 
> anytime a bad input file is encountered my whole job fails.
> 
> My code to load the data using csv is:
> Dataset csv = sparkSession.read()
> .option("delimiter", parseSettings.getDelimiter().toString())
> .option("quote", parseSettings.getQuote())
> .option("parserLib", "UNIVOCITY")
> .csv(paths);
> Previously I would load the data using:
> JavaRDD lines = sc.newAPIHadoopFile(filePaths, NewInputFormat.class,
> LongWritable.class, Text.class, sc.hadoopConfiguration())
> .values()
> .map(Text::toString);
> 
> Looking at the CSVFileFormat.scala class it looks like in the private 
> readText method if I would overwrite where it passes TextInputFormat to the 
> hadoopFile method with my customer format I would be able to achieve what I 
> want.
> 
> private def readText(
> sparkSession: SparkSession,
> options: CSVOptions,
> location: String): RDD[String] = {
>   if (Charset.forName(options.charset) == StandardCharsets.UTF_8) {
> sparkSession.sparkContext.textFile(location)
>   } else {
> val charset = options.charset
> sparkSession.sparkContext
>// This is where I would want to be able to specify my
>// input format instead of TextInputFormat
>   .hadoopFile[LongWritable, Text, TextInputFormat](location)
>   .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, 
> pair._2.getLength, charset)))
>   }
> }
> 
> Does anyone know if there is another way to prevent the corrupt files from 
> failing my job or could help to make the required changes to make the 
> TextInputFormat customizable as I have only just started looking at scala.
> 
> Thanks,
> Nathan
> 


[Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Nathan Case
Accidentally sent this to the dev mailing list, meant to send it here.

I have a spark java application that in the past has used the hadoopFile
interface to specify a custom TextInputFormat to be used when reading
files.  This custom class would gracefully handle exceptions like EOF
exceptions caused by corrupt gzip files in the input data.  I recently
switched to using the csv parser built into spark but am now faced with the
problem that anytime a bad input file is encountered my whole job fails.

My code to load the data using csv is:

Dataset csv = sparkSession.read()
.option("delimiter", parseSettings.getDelimiter().toString())
.option("quote", parseSettings.getQuote())
.option("parserLib", "UNIVOCITY")
.csv(paths);

Previously I would load the data using:

JavaRDD lines = sc.newAPIHadoopFile(filePaths, NewInputFormat.class,
LongWritable.class, Text.class, sc.hadoopConfiguration())
.values()
.map(Text::toString);


Looking at the CSVFileFormat.scala class it looks like in the private
readText method if I would overwrite where it passes TextInputFormat to the
hadoopFile method with my customer format I would be able to achieve what I
want.

private def readText(
sparkSession: SparkSession,
options: CSVOptions,
location: String): RDD[String] = {
  if (Charset.forName(options.charset) == StandardCharsets.UTF_8) {
sparkSession.sparkContext.textFile(location)
  } else {
val charset = options.charset
sparkSession.sparkContext

   // This is where I would want to be able to specify my

   // input format instead of TextInputFormat

  .hadoopFile[LongWritable, Text, TextInputFormat](location)
  .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0,
pair._2.getLength, charset)))
  }
}


Does anyone know if there is another way to prevent the corrupt files from
failing my job or could help to make the required changes to make the
TextInputFormat customizable as I have only just started looking at scala.

Thanks,
Nathan


Re: Setting spark.yarn.stagingDir in 1.6

2017-03-15 Thread Vikash Pareek
++ Sudhir

On Wed, 15 Mar 2017 at 4:06 PM, Saurav Sinha 
wrote:

> Hi Users,
>
>
> I am running spark job in yarn.
>
> I want to set staging directory to some other location which is by default
> hdfs://host:port/home/$User/
>
> In spark 2.0.0, it can be done by setting spark.yarn.stagingDir.
>
> But in production, we have spark 1.6. Can anyone please suggest how it can
> be done in spark 1.6.
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>
-- 

Best Regards,


[image: InfoObjects Inc.] 
Vikash Pareek
Team Lead  *InfoObjects Inc.*
Big Data Analytics

m: +91 8800206898 a: E5, Jhalana Institutionall Area, Jaipur, Rajasthan
302004
w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com


Re: Fast write datastore...

2017-03-15 Thread Shiva Ramagopal
Probably Cassandra is a good choice if you are mainly looking for a
datastore that supports fast writes. You can ingest the data into a table
and define one or more materialized views on top of it to support your
queries. Since you mention that your queries are going to be simple you can
define your indexes in the materialized views according to how you want to
query the data.

Thanks,
Shiva



On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar  wrote:

> Hello Vincent,
>
> Cassandra may not fit my bill if I need to define my partition and other
> indexes upfront. Is this right?
>
> Hello Richard,
>
> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
> then the connector to Apache Spark did not support Spark 2.0.
>
> Another drastic thought may be repartition the result count to 1 (but have
> to be cautions on making sure I don't run into Heap issues if the result is
> too large to fit into an executor)  and write to a relational database like
> mysql / postgres. But, I believe I can do the same using ElasticSearch too.
>
> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>
> More thoughts welcome please.
>
> Thanks,
> Muthu
>
> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
> wrote:
>
>> maybe Apache Ignite does fit your requirements
>>
>> On 15 March 2017 at 08:44, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Hi
>>> If queries are statics and filters are on the same columns, Cassandra is
>>> a good option.
>>>
>>> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>>>
>>> Hello there,
>>>
>>> I have one or more parquet files to read and perform some aggregate
>>> queries
>>> using Spark Dataframe. I would like to find a reasonable fast datastore
>>> that
>>> allows me to write the results for subsequent (simpler queries).
>>> I did attempt to use ElasticSearch to write the query results using
>>> ElasticSearch Hadoop connector. But I am running into connector write
>>> issues
>>> if the number of Spark executors are too many for ElasticSearch to
>>> handle.
>>> But in the schema sense, this seems a great fit as ElasticSearch has
>>> smartz
>>> in place to discover the schema. Also in the query sense, I can perform
>>> simple filters and sort using ElasticSearch and for more complex
>>> aggregate,
>>> Spark Dataframe can come back to the rescue :).
>>> Please advice on other possible data-stores I could use?
>>>
>>> Thanks,
>>> Muthu
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>
>


Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
Hello Vincent,

Cassandra may not fit my bill if I need to define my partition and other
indexes upfront. Is this right?

Hello Richard,

Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
then the connector to Apache Spark did not support Spark 2.0.

Another drastic thought may be repartition the result count to 1 (but have
to be cautions on making sure I don't run into Heap issues if the result is
too large to fit into an executor)  and write to a relational database like
mysql / postgres. But, I believe I can do the same using ElasticSearch too.

A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

More thoughts welcome please.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
wrote:

> maybe Apache Ignite does fit your requirements
>
> On 15 March 2017 at 08:44, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Hi
>> If queries are statics and filters are on the same columns, Cassandra is
>> a good option.
>>
>> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>>
>> Hello there,
>>
>> I have one or more parquet files to read and perform some aggregate
>> queries
>> using Spark Dataframe. I would like to find a reasonable fast datastore
>> that
>> allows me to write the results for subsequent (simpler queries).
>> I did attempt to use ElasticSearch to write the query results using
>> ElasticSearch Hadoop connector. But I am running into connector write
>> issues
>> if the number of Spark executors are too many for ElasticSearch to handle.
>> But in the schema sense, this seems a great fit as ElasticSearch has
>> smartz
>> in place to discover the schema. Also in the query sense, I can perform
>> simple filters and sort using ElasticSearch and for more complex
>> aggregate,
>> Spark Dataframe can come back to the rescue :).
>> Please advice on other possible data-stores I could use?
>>
>> Thanks,
>> Muthu
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Thrift Server as JDBC endpoint

2017-03-15 Thread Sebastian Piu
Hi all,

I'm doing some research on best ways to expose data created by some of our
spark jobs so that they can be consumed by a client (A Web UI).

The data we need to serve might be huge but we can control the type of
queries that are submitted e.g.:
* Limit number of results
* only accept SELECT statements (i.e. readonly)
* Only expose some pre-calculated datasets, as in, always going to a
particular partitions - no joins etc.

In terms of latency, the lower the better but we don't have any weird
scenarios like sub second responses and stability is hugely preferred.

Is thrift server stable for this kind of use cases? How does it perform
under concurrency? Is it better to have several instances and load balance
them or a single one with more resources?

Would be interested in hearing any experiences from people using this on
prod environments

thanks
Seb


Re: Fast write datastore...

2017-03-15 Thread Richard Siebeling
maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra is a
> good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate queries
> using Spark Dataframe. I would like to find a reasonable fast datastore
> that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to handle.
> But in the schema sense, this seems a great fit as ElasticSearch has smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Setting spark.yarn.stagingDir in 1.6

2017-03-15 Thread Saurav Sinha
Hi Users,


I am running spark job in yarn.

I want to set staging directory to some other location which is by default
hdfs://host:port/home/$User/

In spark 2.0.0, it can be done by setting spark.yarn.stagingDir.

But in production, we have spark 1.6. Can anyone please suggest how it can
be done in spark 1.6.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Fast write datastore...

2017-03-15 Thread vincent gromakowski
Hi
If queries are statics and filters are on the same columns, Cassandra is a
good option.

Le 15 mars 2017 7:04 AM, "muthu"  a écrit :

Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu



--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Scaling Kafka Direct Streming application

2017-03-15 Thread vincent gromakowski
You would probably need dynamic allocation which is only available on yarn
and mesos. Or wait for on going spark k8s integration


Le 15 mars 2017 1:54 AM, "Pranav Shukla"  a
écrit :

> How to scale or possibly auto-scale a spark streaming application
> consuming from kafka and using kafka direct streams. We are using spark
> 1.6.3, cannot move to 2.x unless there is a strong reason.
>
> Scenario:
> Kafka topic with 10 partitions
> Standalone cluster running on kubernetes with 1 master and 2 workers
>
> What we would like to know?
> Increase the number of partitions (say from 10 to 15)
> Add additional worker node without restarting the streaming application
> and start consuming off the additional partitions.
>
> Is this possible? i.e. start additional workers in standalone cluster to
> auto-scale an existing spark streaming application that is already running
> or we have to stop and resubmit the streaming app?
>
> Best Regards,
> Pranav Shukla
>


Re: apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread Hongdi Ren
Since N is decided at runtime, the first idea come to my mind is transform the 
columns into one vector column (VectorIndexer can do that) and then let udf 
handle the vector. Just like many ml transformers do.

 

From: anup ahire 
Date: Wednesday, March 15, 2017 at 2:04 PM
To: 
Subject: apply UDFs to N columns dynamically in dataframe

 

Hello,

 

I have a schema and name of columns to apply UDF to. Name of columns are user 
input and they can differ in numbers for each input.

 

Is there a way to apply UDFs to N columns in dataframe  ?

 

 

Thanks !



Spark SQL Skip and Log bad records

2017-03-15 Thread Aviral Agarwal
Hi guys,

Is there a way to skip some bad records and log them when using DataFrame
API ?


Thanks and Regards,
Aviral Agarwal


Re: [MLlib] kmeans random initialization, same seed every time

2017-03-15 Thread Yuhao Yang
Hi Julian,

Thanks for reporting this. This is a valid issue and I created
https://issues.apache.org/jira/browse/SPARK-19957 to track it.

Right now the seed is set to this.getClass.getName.hashCode.toLong by
default, which indeed keeps the same among multiple fits. Feel free to
leave your comments or send a PR for the fix.

For your problem, you may add .setSeed(new Random().nextLong()) before
fit() as a workaround.

Thanks,
Yuhao

2017-03-14 5:46 GMT-07:00 Julian Keppel :

> I'm sorry, I missed some important informations. I use Spark version 2.0.2
> in Scala 2.11.8.
>
> 2017-03-14 13:44 GMT+01:00 Julian Keppel :
>
>> Hi everybody,
>>
>> I make some experiments with the Spark kmeans implementation of the new
>> DataFrame-API. I compare clustering results of different runs with
>> different parameters. I recognized that for random initialization mode, the
>> seed value is the same every time. How is it calculated? In my
>> understanding the seed should be random if it is not provided by the user.
>>
>> Thank you for you help.
>>
>> Julian
>>
>
>


apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread anup ahire
Hello,

I have a schema and name of columns to apply UDF to. Name of columns are
user input and they can differ in numbers for each input.

Is there a way to apply UDFs to N columns in dataframe  ?


Thanks !


Fast write datastore...

2017-03-15 Thread muthu
Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org