Re: Failed to connect to master ...

2017-03-07 Thread Shixiong(Ryan) Zhu
The Spark master may bind to a different address. Take a look at this page
to find the correct URL: http://VM_IPAddress:8080/

On Tue, Mar 7, 2017 at 10:13 PM, Mina Aslani  wrote:

> Master and worker processes are running!
>
> On Wed, Mar 8, 2017 at 12:38 AM, ayan guha  wrote:
>
>> You need to start Master and worker processes before connecting to them.
>>
>> On Wed, Mar 8, 2017 at 3:33 PM, Mina Aslani  wrote:
>>
>>> Hi,
>>>
>>> I am writing a spark Transformer in intelliJ in Java and trying to
>>> connect to the spark in a VM using setMaster. I get "Failed to connect to
>>> master ..."
>>>
>>> I get 17/03/07 16:20:55 WARN StandaloneAppClient$ClientEndpoint: Failed
>>> to connect to master VM_IPAddress:7077
>>> org.apache.spark.SparkException: Exception thrown in awaitResult
>>> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTi
>>> meout.scala:77)
>>> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTi
>>> meout.scala:75)
>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>> unction.scala:36)
>>> at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout
>>> $1.applyOrElse(RpcTimeout.scala:59)
>>>
>>> SparkSession spark = SparkSession
>>>   .builder()
>>>   .appName("Java Spark SQL")
>>>   //.master("local[1]")
>>>   .master("spark://VM_IPAddress:7077")
>>>   .getOrCreate();
>>>
>>> Dataset lines = spark
>>>   .readStream()
>>>   .format("kafka")  .option("kafka.bootstrap.servers", brokers) 
>>>  .option("subscribe", topic)  .load()
>>>   .selectExpr("CAST(value AS STRING)")  .as(Encoders.STRING());
>>>
>>>
>>>
>>> I get same error when I try master("*spark://spark-master:7077**"*).
>>>
>>> *However, .master("local[1]") *no exception is thrown*.*
>>> *
>>> My Kafka is in the same VM and being new to SPARK still trying to 
>>> understand:
>>> *
>>>
>>> - Why I get above exception and how I can fix it (connect to SPARK in VM 
>>> and read form KAfKA in VM)?
>>>
>>> - Why using "local[1]" no exception is thrown and how to setup to read from 
>>> kafka in VM?
>>>
>>> *- How to stream from Kafka (data in the topic is in json format)?
>>> *
>>> Your input is appreciated!
>>>
>>> Best regards,
>>> Mina
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Failed to connect to master ...

2017-03-07 Thread Mina Aslani
Master and worker processes are running!

On Wed, Mar 8, 2017 at 12:38 AM, ayan guha  wrote:

> You need to start Master and worker processes before connecting to them.
>
> On Wed, Mar 8, 2017 at 3:33 PM, Mina Aslani  wrote:
>
>> Hi,
>>
>> I am writing a spark Transformer in intelliJ in Java and trying to
>> connect to the spark in a VM using setMaster. I get "Failed to connect to
>> master ..."
>>
>> I get 17/03/07 16:20:55 WARN StandaloneAppClient$ClientEndpoint: Failed
>> to connect to master VM_IPAddress:7077
>> org.apache.spark.SparkException: Exception thrown in awaitResult
>> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(
>> RpcTimeout.scala:77)
>> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(
>> RpcTimeout.scala:75)
>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>> unction.scala:36)
>> at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout
>> $1.applyOrElse(RpcTimeout.scala:59)
>>
>> SparkSession spark = SparkSession
>>   .builder()
>>   .appName("Java Spark SQL")
>>   //.master("local[1]")
>>   .master("spark://VM_IPAddress:7077")
>>   .getOrCreate();
>>
>> Dataset lines = spark
>>   .readStream()
>>   .format("kafka")  .option("kafka.bootstrap.servers", brokers)  
>> .option("subscribe", topic)  .load()
>>   .selectExpr("CAST(value AS STRING)")  .as(Encoders.STRING());
>>
>>
>>
>> I get same error when I try master("*spark://spark-master:7077**"*).
>>
>> *However, .master("local[1]") *no exception is thrown*.*
>> *
>> My Kafka is in the same VM and being new to SPARK still trying to understand:
>> *
>>
>> - Why I get above exception and how I can fix it (connect to SPARK in VM and 
>> read form KAfKA in VM)?
>>
>> - Why using "local[1]" no exception is thrown and how to setup to read from 
>> kafka in VM?
>>
>> *- How to stream from Kafka (data in the topic is in json format)?
>> *
>> Your input is appreciated!
>>
>> Best regards,
>> Mina
>>
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Failed to connect to master ...

2017-03-07 Thread ayan guha
You need to start Master and worker processes before connecting to them.

On Wed, Mar 8, 2017 at 3:33 PM, Mina Aslani  wrote:

> Hi,
>
> I am writing a spark Transformer in intelliJ in Java and trying to connect
> to the spark in a VM using setMaster. I get "Failed to connect to master
> ..."
>
> I get 17/03/07 16:20:55 WARN StandaloneAppClient$ClientEndpoint: Failed
> to connect to master VM_IPAddress:7077
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.
> applyOrElse(RpcTimeout.scala:77)
> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.
> applyOrElse(RpcTimeout.scala:75)
> at scala.runtime.AbstractPartialFunction.apply(
> AbstractPartialFunction.scala:36)
> at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.
> applyOrElse(RpcTimeout.scala:59)
>
> SparkSession spark = SparkSession
>   .builder()
>   .appName("Java Spark SQL")
>   //.master("local[1]")
>   .master("spark://VM_IPAddress:7077")
>   .getOrCreate();
>
> Dataset lines = spark
>   .readStream()
>   .format("kafka")  .option("kafka.bootstrap.servers", brokers)  
> .option("subscribe", topic)  .load()
>   .selectExpr("CAST(value AS STRING)")  .as(Encoders.STRING());
>
>
>
> I get same error when I try master("*spark://spark-master:7077**"*).
>
> *However, .master("local[1]") *no exception is thrown*.*
> *
> My Kafka is in the same VM and being new to SPARK still trying to understand:
> *
>
> - Why I get above exception and how I can fix it (connect to SPARK in VM and 
> read form KAfKA in VM)?
>
> - Why using "local[1]" no exception is thrown and how to setup to read from 
> kafka in VM?
>
> *- How to stream from Kafka (data in the topic is in json format)?
> *
> Your input is appreciated!
>
> Best regards,
> Mina
>
>
>
>


-- 
Best Regards,
Ayan Guha


Failed to connect to master ...

2017-03-07 Thread Mina Aslani
Hi,

I am writing a spark Transformer in intelliJ in Java and trying to connect
to the spark in a VM using setMaster. I get "Failed to connect to master
..."

I get 17/03/07 16:20:55 WARN StandaloneAppClient$ClientEndpoint: Failed to
connect to master VM_IPAddress:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)

SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark SQL")
  //.master("local[1]")
  .master("spark://VM_IPAddress:7077")
  .getOrCreate();

Dataset lines = spark
  .readStream()
  .format("kafka")  .option("kafka.bootstrap.servers",
brokers)  .option("subscribe", topic)  .load()
  .selectExpr("CAST(value AS STRING)")  .as(Encoders.STRING());



I get same error when I try master("*spark://spark-master:7077**"*).

*However, .master("local[1]") *no exception is thrown*.*
*
My Kafka is in the same VM and being new to SPARK still trying to understand:
*

- Why I get above exception and how I can fix it (connect to SPARK in
VM and read form KAfKA in VM)?

- Why using "local[1]" no exception is thrown and how to setup to read
from kafka in VM?

*- How to stream from Kafka (data in the topic is in json format)?
*
Your input is appreciated!

Best regards,
Mina


Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread cht liu
Do you enable the spark fault tolerance mechanism, RDD run at the end of
the job, will start a separate job, to the checkpoint data written to the
file system before the persistence of high availability

2017-03-08 2:45 GMT+08:00 Swapnil Shinde :

> Hello all
>I have a spark job that reads parquet data and partition it based on
> one of the columns. I made sure partitions equally distributed and not
> skewed. My code looks like this -
>
> datasetA.write.partitonBy("column1").parquet(outputPath)
>
> Execution plan -
> [image: Inline image 1]
>
> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
> to close application. I am not sure what spark is doing after all tasks are
> processes successfully.
> I checked thread dump (using UI executor tab) on few executors but couldnt
> find anything major. Overall, few shuffle-client processes are "RUNNABLE"
> and few dispatched-* processes are "WAITING".
>
> Please let me know what spark is doing at this stage(after all tasks
> finished) and any way I can optimize it.
>
> Thanks
> Swapnil
>
>
>


made spark job to throw exception still going under finished succeeded status in yarn

2017-03-07 Thread nancy henry
Hi Team,

Wrote below code to throw exception.. How to make below code to throw
exception and make the job to goto failed status in yarn if under some
condition but still close spark context and release resources ..


object Demo {
  def main(args: Array[String]) = {

var a = 0; var c = 0;

val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)

val hiveSqlContext: HiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)
for (a <- 0 to args.length - 1) {
  val query = sc.textFile(args(a)).collect.filter(query =>
!query.contains("--")).mkString(" ")
  var queryarray = query.split(";")
  var b = query.split(";").length

  var querystatuscheck = true;
  for (c <- 0 to b - 1) {

if (querystatuscheck) {

  if (!(StringUtils.isBlank(queryarray(c {

*val querystatus = Try { hiveSqlContext.sql(queryarray(c)) }*



var b = c + 1
querystatuscheck = querystatus.isSuccess
System.out.println("Your" + b + "query status is " +
querystatus)
System.out.println("querystatuschecktostring is" +
querystatuscheck.toString())
querystatuscheck.toString() match {
  case "false" => {

  *  throw querystatus.failed.get*
System.out.println("case true executed")
sc.stop()
  }
  case _ => {
sc.stop()
System.out.println("case default executed")
  }

}

  }

}
  }

  System.out.println("Okay")

}

  }

}


PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-07 Thread Yeoul Na

Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I
tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
articles say straight-forward Python implementation is the slowest due to
the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and
deserialization time of PySpark do not seem to be any bigger than that of
Scala. The main contributor was "Executor Computing Time". Thus, we cannot
sure whether this is due to serialization or because Python code is
basically slower than Scala code. 

So my question is that does "Task Deserialization Time" in Spark WebUI
actually include serialization/deserialization times in PySpark? If this is
not the case, how can I actually measure the serialization/deserialization
overhead? 

Thanks,
Yeoul



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: finding Spark Master

2017-03-07 Thread Yong Zhang
This website explains it very clear, if you are using Yarn.


https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_running_spark_on_yarn.html

Running Spark Applications on YARN - 
cloudera.com
www.cloudera.com
When Spark applications run on a YARN cluster manager, resource management, 
scheduling, and security are controlled by YARN.






From: Adaryl Wakefield 
Sent: Tuesday, March 7, 2017 8:53 PM
To: Koert Kuipers
Cc: user@spark.apache.org
Subject: RE: finding Spark Master


Ah so I see setMaster(‘yarn-client’). Hmm.



What I was ultimately trying to do was develop with Eclipse on my windows box 
and have the code point to my cluster so it executes there instead of my local 
windows machine. Perhaps I’m going about this wrong.



Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685

www.massstreet.net

www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData



From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Tuesday, March 7, 2017 7:47 PM
To: Adaryl Wakefield 
Cc: user@spark.apache.org
Subject: Re: finding Spark Master



assuming this is running on yarn there is really spark-master. every job 
created its own "master" within a yarn application.



On Tue, Mar 7, 2017 at 6:27 PM, Adaryl Wakefield 
> wrote:

I’m running a three node cluster along with Spark along with Hadoop as part of 
a HDP stack. How do I find my Spark Master? I’m just seeing the clients. I’m 
trying to figure out what goes in setMaster() aside from local[*].



Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685

www.massstreet.net

www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData






RE: finding Spark Master

2017-03-07 Thread Adaryl Wakefield
Ah so I see setMaster(‘yarn-client’). Hmm.

What I was ultimately trying to do was develop with Eclipse on my windows box 
and have the code point to my cluster so it executes there instead of my local 
windows machine. Perhaps I’m going about this wrong.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Tuesday, March 7, 2017 7:47 PM
To: Adaryl Wakefield 
Cc: user@spark.apache.org
Subject: Re: finding Spark Master

assuming this is running on yarn there is really spark-master. every job 
created its own "master" within a yarn application.

On Tue, Mar 7, 2017 at 6:27 PM, Adaryl Wakefield 
> wrote:
I’m running a three node cluster along with Spark along with Hadoop as part of 
a HDP stack. How do I find my Spark Master? I’m just seeing the clients. I’m 
trying to figure out what goes in setMaster() aside from local[*].

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData




Re: finding Spark Master

2017-03-07 Thread Koert Kuipers
assuming this is running on yarn there is really spark-master. every job
created its own "master" within a yarn application.

On Tue, Mar 7, 2017 at 6:27 PM, Adaryl Wakefield <
adaryl.wakefi...@hotmail.com> wrote:

> I’m running a three node cluster along with Spark along with Hadoop as
> part of a HDP stack. How do I find my Spark Master? I’m just seeing the
> clients. I’m trying to figure out what goes in setMaster() aside from
> local[*].
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685 <(913)%20938-6685>
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>


Spark job stopping abrubptly

2017-03-07 Thread Divya Gehlot
Hi,
I have spark standalone cluster on AWS EC2 and recently my spark stream
jobs stopping
abrubptly.
When I check the logs I found this

17/03/07 06:09:39 INFO ProtocolStateActor: No response from remote.
Handshake timed out or transport failure detector triggered.
17/03/07 06:09:39 ERROR WorkerWatcher: Lost connection to worker rpc
endpoint akka.tcp://sparkwor...@xxx.xx.xx.xxx:44271/user/Worker.
Exiting.
17/03/07 06:09:39 WARN CoarseGrainedExecutorBackend: An unknown
(xxx.xx.x.xxx:44271) driver disconnected.

17/03/07 06:09:39 INFO DiskBlockManager: Shutdown hook called

My suspect it is due to network connection between AWS instances of spark
cluster.

Could somebody help me put more light on it?


Thanks
Divya


RE: finding Spark Master

2017-03-07 Thread Adaryl Wakefield
I’m sorry I don’t understand. Is that a question or the answer?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, March 7, 2017 5:59 PM
To: Adaryl Wakefield ; user@spark.apache.org
Subject: Re: finding Spark Master

 yarn-client or yarn-cluster

On Wed, 8 Mar 2017 at 10:28 am, Adaryl Wakefield 
> wrote:
I’m running a three node cluster along with Spark along with Hadoop as part of 
a HDP stack. How do I find my Spark Master? I’m just seeing the clients. I’m 
trying to figure out what goes in setMaster() aside from local[*].

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

--
Best Regards,
Ayan Guha


Re: finding Spark Master

2017-03-07 Thread ayan guha
 yarn-client or yarn-cluster

On Wed, 8 Mar 2017 at 10:28 am, Adaryl Wakefield <
adaryl.wakefi...@hotmail.com> wrote:

> I’m running a three node cluster along with Spark along with Hadoop as
> part of a HDP stack. How do I find my Spark Master? I’m just seeing the
> clients. I’m trying to figure out what goes in setMaster() aside from
> local[*].
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
-- 
Best Regards,
Ayan Guha


finding Spark Master

2017-03-07 Thread Adaryl Wakefield
I'm running a three node cluster along with Spark along with Hadoop as part of 
a HDP stack. How do I find my Spark Master? I'm just seeing the clients. I'm 
trying to figure out what goes in setMaster() aside from local[*].

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData



Re: How to unit test spark streaming?

2017-03-07 Thread kant kodali
Agreed with the statement in quotes below whether one wants to do unit
tests or not It is a good practice to write code that way. But I think the
more painful and tedious task is to mock/emulate all the nodes such as
spark workers/master/hdfs/input source stream and all that. I wish there is
something really simple. Perhaps the simplest thing to do is just to do
integration tests which also tests the transformations/business logic. This
way I can spawn a small cluster and run my tests and bring my cluster down
when I am done. And sure if the cluster isn't available then I can't run
the tests however some node should be available even to run a single
process. I somehow feel like we may doing too much work to fit into the
archaic definition of unit tests.

 "Basically you abstract your transformations to take in a dataframe and
return one, then you assert on the returned df " this

On Tue, Mar 7, 2017 at 11:14 AM, Michael Armbrust 
wrote:

> Basically you abstract your transformations to take in a dataframe and
>> return one, then you assert on the returned df
>>
>
> +1 to this suggestion.  This is why we wanted streaming and batch
> dataframes to share the same API.
>


Re: Structured Streaming - Kafka

2017-03-07 Thread Bowden, Chris
https://issues.apache.org/jira/browse/SPARK-19853, pr by eow


From: Shixiong(Ryan) Zhu 
Sent: Tuesday, March 7, 2017 2:04:45 PM
To: Bowden, Chris
Cc: user; Gudenkauf, Jack
Subject: Re: Structured Streaming - Kafka

Good catch. Could you create a ticket? You can also submit a PR to fix it if 
you have time :)

On Tue, Mar 7, 2017 at 1:52 PM, Bowden, Chris 
> wrote:

Potential bug when using startingOffsets = SpecificOffsets with Kafka topics 
containing uppercase characters?

KafkaSourceProvider#L80/86:

val startingOffsets =
  
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
case Some("latest") => LatestOffsets
case Some("earliest") => EarliestOffsets
case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
case None => LatestOffsets
  }

Topics in JSON get lowered so underlying assignments in the consumer are 
incorrect, and the assertion in KafkaSource#L326 triggers:

private def fetchSpecificStartingOffsets(
partitionOffsets: Map[TopicPartition, Long]): Map[TopicPartition, Long] = {
  val result = withRetriesWithoutInterrupt {
// Poll to get the latest assigned partitions
consumer.poll(0)
val partitions = consumer.assignment()
consumer.pause(partitions)
assert(partitions.asScala == partitionOffsets.keySet,
  "If startingOffsets contains specific offsets, you must specify all 
TopicPartitions.\n" +
"Use -1 for latest, -2 for earliest, if you don't care.\n" +
s"Specified: ${partitionOffsets.keySet} Assigned: 
${partitions.asScala}")



Re: Structured Streaming - Kafka

2017-03-07 Thread Shixiong(Ryan) Zhu
Good catch. Could you create a ticket? You can also submit a PR to fix it
if you have time :)

On Tue, Mar 7, 2017 at 1:52 PM, Bowden, Chris  wrote:

> Potential bug when using startingOffsets = SpecificOffsets with Kafka
> topics containing uppercase characters?
>
> KafkaSourceProvider#L80/86:
>
> val startingOffsets =
>   
> caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase)
>  match {
> case Some("latest") => LatestOffsets
> case Some("earliest") => EarliestOffsets
> case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
> case None => LatestOffsets
>   }
>
> Topics in JSON get lowered so underlying assignments in the consumer are
> incorrect, and the assertion in KafkaSource#L326 triggers:
>
> private def fetchSpecificStartingOffsets(
> partitionOffsets: Map[TopicPartition, Long]): Map[TopicPartition, Long] = 
> {
>   val result = withRetriesWithoutInterrupt {
> // Poll to get the latest assigned partitions
> consumer.poll(0)
> val partitions = consumer.assignment()
> consumer.pause(partitions)
> assert(partitions.asScala == partitionOffsets.keySet,
>   "If startingOffsets contains specific offsets, you must specify all 
> TopicPartitions.\n" +
> "Use -1 for latest, -2 for earliest, if you don't care.\n" +
> s"Specified: ${partitionOffsets.keySet} Assigned: 
> ${partitions.asScala}")
>
>


Structured Streaming - Kafka

2017-03-07 Thread Bowden, Chris
Potential bug when using startingOffsets = SpecificOffsets with Kafka topics 
containing uppercase characters?

KafkaSourceProvider#L80/86:

val startingOffsets =
  
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
case Some("latest") => LatestOffsets
case Some("earliest") => EarliestOffsets
case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
case None => LatestOffsets
  }

Topics in JSON get lowered so underlying assignments in the consumer are 
incorrect, and the assertion in KafkaSource#L326 triggers:

private def fetchSpecificStartingOffsets(
partitionOffsets: Map[TopicPartition, Long]): Map[TopicPartition, Long] = {
  val result = withRetriesWithoutInterrupt {
// Poll to get the latest assigned partitions
consumer.poll(0)
val partitions = consumer.assignment()
consumer.pause(partitions)
assert(partitions.asScala == partitionOffsets.keySet,
  "If startingOffsets contains specific offsets, you must specify all 
TopicPartitions.\n" +
"Use -1 for latest, -2 for earliest, if you don't care.\n" +
s"Specified: ${partitionOffsets.keySet} Assigned: 
${partitions.asScala}")


Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users,

I am working with PySpark Code migration to scala, with Python - Iterating
Spark with dictionary and generating JSON with null is possible with
json.dumps() which will be converted to SparkSQL[Row] but in scala how can
we generate json will null values as a Dataframe ?

Thanks.


Does anybody use spark.rpc.io.mode=epoll?

2017-03-07 Thread Steven Ruppert
The epoll mode definitely exists in spark, but the official
documentation does not mention it, nor any of the other settings that
appear to be unofficially documented in:

https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rpc-netty.adoc

I don't seem to have any particular performance problems with the
default NIO impl, but the "lower gc pressure" mentioned in the
official netty docs
https://github.com/netty/netty/wiki/Native-transports does seem
attractive.

However, the fact that it's not even documented gives me pause. Is it
deprecated, or perhaps just not useful? Do I have to stick the native
library jar into the spark classpath to use it?

-- 
*CONFIDENTIALITY NOTICE: This email message, and any documents, files or 
previous e-mail messages attached to it is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email 
and destroy all copies of the original message.*

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to unit test spark streaming?

2017-03-07 Thread Michael Armbrust
>
> Basically you abstract your transformations to take in a dataframe and
> return one, then you assert on the returned df
>

+1 to this suggestion.  This is why we wanted streaming and batch
dataframes to share the same API.


Re: Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas
I was kind of hoping that I would use Spark in this instance to generate
that intermediate SQL as part of its workflow strategy. Sort of as a
database independent way of doing my preprocessing.
Is there any way that allows me to capture the generated SQL from catalyst?
If so I would just use JDBCRdd with that.

The other option being to generate that SQL in text format which isn't the
nicest thing to do.

On Mar 7, 2017 5:02 PM, "Subhash Sriram"  wrote:

> Could you create a view of the table on your JDBC data source and just
> query that from Spark?
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas 
> wrote:
> >
> > As an example, this is basically what I'm doing:
> >
> >  val myDF = 
> > originalDataFrame.select(col(columnName).when(col(columnName)
> === "foobar", 0).when(col(columnName) === "foobarbaz", 1))
> >
> > Except there's much more columns and much more conditionals. The
> generated Spark workflow starts with an SQL that basically does:
> >
> >SELECT columnName, columnName2, etc. from table;
> >
> > Then the conditionals/transformations are evaluated on the cluster.
> >
> > Is there a way from the DataSet API to force the computation to happen
> on the SQL data source in this case? Or should I work with JDBCRDD and use
> createDataFrame on that?
> >
> >
> >> On 03/07/2017 02:19 PM, Jörn Franke wrote:
> >> Can you provide some source code? I am not sure I understood the
> problem .
> >> If you want to do a preprocessing at the JDBC datasource then you can
> write your own data source. Additionally you may want to modify the sql
> statement to extract the data in the right format and push some
> preprocessing to the database.
> >>
> >>> On 7 Mar 2017, at 12:04, El-Hassan Wanas 
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> There is, as usual, a big table lying on some JDBC data source. I am
> doing some data processing on that data from Spark, however, in order to
> speed up my analysis, I use reduced encodings and minimize the general size
> of the data before processing.
> >>>
> >>> Spark has been doing a great job at generating the proper workflows
> that do that preprocessing for me, but it seems to generate those workflows
> for execution on the Spark Cluster. The issue with that is the large
> transfer cost is still incurred.
> >>>
> >>> Is there any way to force Spark to run the preprocessing on the JDBC
> data source and get the prepared output DataFrame instead?
> >>>
> >>> Thanks,
> >>>
> >>> Wanas
> >>>
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread Swapnil Shinde
Hello all
   I have a spark job that reads parquet data and partition it based on one
of the columns. I made sure partitions equally distributed and not skewed.
My code looks like this -

datasetA.write.partitonBy("column1").parquet(outputPath)

Execution plan -
[image: Inline image 1]

All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
to close application. I am not sure what spark is doing after all tasks are
processes successfully.
I checked thread dump (using UI executor tab) on few executors but couldnt
find anything major. Overall, few shuffle-client processes are "RUNNABLE"
and few dispatched-* processes are "WAITING".

Please let me know what spark is doing at this stage(after all tasks
finished) and any way I can optimize it.

Thanks
Swapnil


RE: using spark to load a data warehouse in real time

2017-03-07 Thread Adaryl Wakefield
Hi Henry,
I didn’t catch your email until now. When you wrote to the database, how did 
you enforce the schema? Did the data frames just spit everything out with the 
necessary keys?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Henry Tremblay [mailto:paulhtremb...@gmail.com]
Sent: Tuesday, February 28, 2017 3:56 PM
To: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time


We did this all the time at my last position.

1. We had unstructured data in S3.

2.We read directly from S3 and then gave structure to the data by a dataframe 
in Spark.

3. We wrote the results to S3

4. We used Redshift's super fast parallel ability to load the results into a 
table.

Henry

On 02/28/2017 11:04 AM, Mohammad Tariq wrote:
You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each RDD 
into a DataFrame. Use these DataFrames to perform whatever processing is 
required and then save that DataFrame into your target relational warehouse.

HTH


[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti









Tariq, Mohammad
about.me/mti





  

 

 
On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq  
wrote:
Hi Adaryl, 
 
You could definitely load data into a warehouse through Spark's JDBC support 
through DataFrames. Could you please explain your use case a bit more? That'll 
help us in answering your query better.
 
 
 


[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]

 

Tariq, Mohammad
about.me/mti




 


 


Tariq, Mohammad
about.me/mti






 

 
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield 
 wrote:
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of 
course have to be in any architecture but it looks like they are suggesting 
that Kafka is all you need. 
 
My primary concern is the complexity of loading warehouses. I have a web 
development background so I have somewhat of an idea on how to insert data into 
a database from an application. I’ve since moved on to straight database 
programming and don’t work with anything that reads from an app anymore. 

 
Loading a warehouse requires a lot of cleaning of data and running and grabbing 
keys to maintain referential integrity. Usually that’s done in a batch process. 
Now I have to do it record by record (or a few records). I have some ideas but 
I’m not quite there yet.
 
I thought SparkSQL would be the way to get this done but so far, all the 
examples I’ve seen are just SELECT statements, no INSERTS or MERGE 
statements.
 
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData
 
From: Femi Anthony [mailto:femib...@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield 
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real 
time
 
Have you checked to see if there are any drivers to enable you to write to 
Greenplum directly from Spark ?
 
You can also take a look at this link:
 
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q
 
Apparently GPDB is based on Postgres so maybe that approach may work. 

Another approach maybe for Spark Streaming to write to Kafka, and then have 
another process read from Kafka and write to Greenplum.
 
Kafka Connect may be useful in this case -
 
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
 
Femi 

Re: (python) Spark .textFile(s3://…) access denied 403 with valid credentials

2017-03-07 Thread Amjad ALSHABANI
Hi Jonhy,

What is the master you are using with spark-submit?

I ve had this problem before because Spark (different from CLI and boto3)
 was running in Yarn distributed mode (--master yarn) So the keys  were not
copied to all the executors' nodes so I have had to submit my spark job as
following:

$ spark-submit --master yarn-client --conf
"spark.executor.extraJavaOptions=-Daws.accessKeyId=ACCESSKEY
-Daws.secretKey=SECRETKEY"


I hope this will help


Amjad

On Tue, Mar 7, 2017 at 4:21 PM, Jonhy Stack  wrote:

> In order to access my S3 bucket i have exported my creds
>
> export AWS_SECRET_ACCESS_KEY=
> export AWS_ACCESSS_ACCESS_KEY=
>
> I can verify that everything works by doing
>
> aws s3 ls mybucket
>
> I can also verify with boto3 that it works in python
>
> resource = boto3.resource("s3", region_name="us-east-1")
> resource.Object("mybucket", "text/text.py") \
> .put(Body=open("text.py", "rb"),ContentType="text/x-py")
>
> This works and I can see the file in the bucket.
>
> However when I do this with spark:
>
> spark_context = SparkContext()
> sql_context = SQLContext(spark_context)
> spark_context.textFile("s3://mybucket/my/path/*)
>
> I get a nice
>
> > Caused by: org.jets3t.service.S3ServiceException: Service Error
> > Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> > Message:  > encoding="UTF-8"?>InvalidAccessKeyIdThe
> > AWS Access Key Id you provided does not exist in our
> > records.[MY_ACCESS_KEY] KeyId>Xxxx
>
> this is how I submit the job locally
>
> spark-submit --packages com.amazonaws:aws-java-sdk-pom
> :1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py
>
> Why does it works with command line + boto3 but spark is chocking ?
>


(python) Spark .textFile(s3://…) access denied 403 with valid credentials

2017-03-07 Thread Jonhy Stack
In order to access my S3 bucket i have exported my creds

export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=

I can verify that everything works by doing

aws s3 ls mybucket

I can also verify with boto3 that it works in python

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")

This works and I can see the file in the bucket.

However when I do this with spark:

spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)

I get a nice

> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message:  encoding="UTF-8"?>InvalidAccessKeyIdThe
> AWS Access Key Id you provided does not exist in our
> records.[MY_ACCESS_KEY]Xxxx

this is how I submit the job locally

spark-submit --packages com.amazonaws:aws-java-sdk-pom
:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py

Why does it works with command line + boto3 but spark is chocking ?


Re: How to unit test spark streaming?

2017-03-07 Thread Jörn Franke
This depends on your target setup! I run for example for my open source 
libraries for spark integration tests (a dedicated folder a side the unit 
tests) a local spark master, but also use a minidfs cluster (to simulate HDFS 
on a node) and sometimes also a miniyarn cluster (see 
https://wiki.apache.org/hadoop/HowToDevelopUnitTests).

 An example can be found here:  
https://github.com/ZuInnoTe/hadoopcryptoledger/tree/master/examples/spark-bitcoinblock
 

or - if you need Scala - 
https://github.com/ZuInnoTe/hadoopcryptoledger/tree/master/examples/scala-spark-bitcoinblock
 

In both cases it is in the integration-tests (Java) or it (Scala) folder.

Spark Streaming - I have no open source example at hand, but basically you need 
to simulate the source and the rest is as above.

 I will eventually write a blog post about this with more details.

> On 7 Mar 2017, at 13:04, kant kodali  wrote:
> 
> Hi All,
> 
> How to unit test spark streaming or spark in general? How do I test the 
> results of my transformations? Also, more importantly don't we need to spawn 
> master and worker JVM's either in one or multiple nodes?
> 
> Thanks!
> kant

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark JDBC reads

2017-03-07 Thread Subhash Sriram
Could you create a view of the table on your JDBC data source and just query 
that from Spark?

Thanks,
Subhash 

Sent from my iPhone

> On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas  wrote:
> 
> As an example, this is basically what I'm doing:
> 
>  val myDF = originalDataFrame.select(col(columnName).when(col(columnName) 
> === "foobar", 0).when(col(columnName) === "foobarbaz", 1))
> 
> Except there's much more columns and much more conditionals. The generated 
> Spark workflow starts with an SQL that basically does:
> 
>SELECT columnName, columnName2, etc. from table;
> 
> Then the conditionals/transformations are evaluated on the cluster.
> 
> Is there a way from the DataSet API to force the computation to happen on the 
> SQL data source in this case? Or should I work with JDBCRDD and use 
> createDataFrame on that?
> 
> 
>> On 03/07/2017 02:19 PM, Jörn Franke wrote:
>> Can you provide some source code? I am not sure I understood the problem .
>> If you want to do a preprocessing at the JDBC datasource then you can write 
>> your own data source. Additionally you may want to modify the sql statement 
>> to extract the data in the right format and push some preprocessing to the 
>> database.
>> 
>>> On 7 Mar 2017, at 12:04, El-Hassan Wanas  wrote:
>>> 
>>> Hello,
>>> 
>>> There is, as usual, a big table lying on some JDBC data source. I am doing 
>>> some data processing on that data from Spark, however, in order to speed up 
>>> my analysis, I use reduced encodings and minimize the general size of the 
>>> data before processing.
>>> 
>>> Spark has been doing a great job at generating the proper workflows that do 
>>> that preprocessing for me, but it seems to generate those workflows for 
>>> execution on the Spark Cluster. The issue with that is the large transfer 
>>> cost is still incurred.
>>> 
>>> Is there any way to force Spark to run the preprocessing on the JDBC data 
>>> source and get the prepared output DataFrame instead?
>>> 
>>> Thanks,
>>> 
>>> Wanas
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to unit test spark streaming?

2017-03-07 Thread Sam Elamin
Hey kant

You can use holdens spark test base

Have a look at some of the specs I wrote here to give you an idea

https://github.com/samelamin/spark-bigquery/blob/master/src/test/scala/com/samelamin/spark/bigquery/BigQuerySchemaSpecs.scala

Basically you abstract your transformations to take in a dataframe and
return one, then you assert on the returned df

Regards
Sam
On Tue, 7 Mar 2017 at 12:05, kant kodali  wrote:

> Hi All,
>
> How to unit test spark streaming or spark in general? How do I test the
> results of my transformations? Also, more importantly don't we need to
> spawn master and worker JVM's either in one or multiple nodes?
>
> Thanks!
> kant
>


How to unit test spark streaming?

2017-03-07 Thread kant kodali
Hi All,

How to unit test spark streaming or spark in general? How do I test the
results of my transformations? Also, more importantly don't we need to
spawn master and worker JVM's either in one or multiple nodes?

Thanks!
kant


Re: Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas

As an example, this is basically what I'm doing:

  val myDF = 
originalDataFrame.select(col(columnName).when(col(columnName) === 
"foobar", 0).when(col(columnName) === "foobarbaz", 1))


Except there's much more columns and much more conditionals. The 
generated Spark workflow starts with an SQL that basically does:


SELECT columnName, columnName2, etc. from table;

Then the conditionals/transformations are evaluated on the cluster.

Is there a way from the DataSet API to force the computation to happen 
on the SQL data source in this case? Or should I work with JDBCRDD and 
use createDataFrame on that?



On 03/07/2017 02:19 PM, Jörn Franke wrote:

Can you provide some source code? I am not sure I understood the problem .
If you want to do a preprocessing at the JDBC datasource then you can write 
your own data source. Additionally you may want to modify the sql statement to 
extract the data in the right format and push some preprocessing to the 
database.


On 7 Mar 2017, at 12:04, El-Hassan Wanas  wrote:

Hello,

There is, as usual, a big table lying on some JDBC data source. I am doing some 
data processing on that data from Spark, however, in order to speed up my 
analysis, I use reduced encodings and minimize the general size of the data 
before processing.

Spark has been doing a great job at generating the proper workflows that do 
that preprocessing for me, but it seems to generate those workflows for 
execution on the Spark Cluster. The issue with that is the large transfer cost 
is still incurred.

Is there any way to force Spark to run the preprocessing on the JDBC data 
source and get the prepared output DataFrame instead?

Thanks,

Wanas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark JDBC reads

2017-03-07 Thread Jörn Franke
Can you provide some source code? I am not sure I understood the problem .
If you want to do a preprocessing at the JDBC datasource then you can write 
your own data source. Additionally you may want to modify the sql statement to 
extract the data in the right format and push some preprocessing to the 
database.

> On 7 Mar 2017, at 12:04, El-Hassan Wanas  wrote:
> 
> Hello,
> 
> There is, as usual, a big table lying on some JDBC data source. I am doing 
> some data processing on that data from Spark, however, in order to speed up 
> my analysis, I use reduced encodings and minimize the general size of the 
> data before processing.
> 
> Spark has been doing a great job at generating the proper workflows that do 
> that preprocessing for me, but it seems to generate those workflows for 
> execution on the Spark Cluster. The issue with that is the large transfer 
> cost is still incurred.
> 
> Is there any way to force Spark to run the preprocessing on the JDBC data 
> source and get the prepared output DataFrame instead?
> 
> Thanks,
> 
> Wanas
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas

Hello,

There is, as usual, a big table lying on some JDBC data source. I am 
doing some data processing on that data from Spark, however, in order to 
speed up my analysis, I use reduced encodings and minimize the general 
size of the data before processing.


Spark has been doing a great job at generating the proper workflows that 
do that preprocessing for me, but it seems to generate those workflows 
for execution on the Spark Cluster. The issue with that is the large 
transfer cost is still incurred.


Is there any way to force Spark to run the preprocessing on the JDBC 
data source and get the prepared output DataFrame instead?


Thanks,

Wanas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Check if dataframe is empty

2017-03-07 Thread Deepak Sharma
On Tue, Mar 7, 2017 at 2:37 PM, Nick Pentreath 
wrote:

> df.take(1).isEmpty should work


My bad.
It will return empty array:
 emptydf.take(1)
res0: Array[org.apache.spark.sql.Row] = Array()

and applying isEmpty would return boolean
 emptydf.take(1).isEmpty
res2: Boolean = true




-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Check if dataframe is empty

2017-03-07 Thread Nick Pentreath
I believe take on an empty dataset will return an empty Array rather than
throw an exception.

df.take(1).isEmpty should work

On Tue, 7 Mar 2017 at 07:42, Deepak Sharma  wrote:

> If the df is empty , the .take would return
> java.util.NoSuchElementException.
> This can be done as below:
> df.rdd.isEmpty
>
>
> On Tue, Mar 7, 2017 at 9:33 AM,  wrote:
>
> Dataframe.take(1) is faster.
>
>
>
> *From:* ashaita...@nz.imshealth.com [mailto:ashaita...@nz.imshealth.com]
> *Sent:* Tuesday, March 07, 2017 9:22 AM
> *To:* user@spark.apache.org
> *Subject:* Check if dataframe is empty
>
>
>
> Hello!
>
>
>
> I am pretty sure that I am asking something which has been already asked
> lots of times. However, I cannot find the question in the mailing list
> archive.
>
>
>
> The question is – I need to check whether dataframe is empty or not. I
> receive a dataframe from 3rd party library and this dataframe can be
> potentially empty, but also can be really huge – millions of rows. Thus, I
> want to avoid of doing some logic in case the dataframe is empty. How can I
> efficiently check it?
>
>
>
> Right now I am doing it in the following way:
>
>
>
> *private def *isEmpty(df: Option[DataFrame]): Boolean = {
>   df.isEmpty || (df.isDefined && df.get.limit(1).*rdd*.isEmpty())
> }
>
>
>
> But the performance is really slow for big dataframes. I would be grateful
> for any suggestions.
>
>
>
> Thank you in advance.
>
>
>
>
> Best regards,
>
>
>
> Artem
>
>
> --
>
> ** IMPORTANT--PLEASE READ 
> This electronic message, including its attachments, is CONFIDENTIAL and may
> contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is
> intended for the authorized recipient of the sender. If you are not the
> intended recipient, you are hereby notified that any use, disclosure,
> copying, or distribution of this message or any of the information included
> in it is unauthorized and strictly prohibited. If you have received this
> message in error, please immediately notify the sender by reply e-mail and
> permanently delete this message and its attachments, along with any copies
> thereof, from all locations received (e.g., computer, mobile device, etc.).
> Thank you.
> 
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> __
>
> www.accenture.com
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-07 Thread Eli Super
Hi

It's area of knowledge , you will need to read online several hours about
it

What is your programming language ?

Try search online : "machine learning binning %my_programing_langauge%"
and
"machine learning feature engineering %my_programing_langauge%"

On Tue, Mar 7, 2017 at 3:39 AM, Raju Bairishetti  wrote:

> @Eli, Thanks for the suggestion. If you do not mind can you please
> elaborate approaches?
>
> On Mon, Mar 6, 2017 at 7:29 PM, Eli Super  wrote:
>
>> Hi
>>
>> Try to implement binning and/or feature engineering (smart feature
>> selection for example)
>>
>> Good luck
>>
>> On Mon, Mar 6, 2017 at 6:56 AM, Raju Bairishetti  wrote:
>>
>>> Hi,
>>>   I am new to Spark ML Lib. I am using FPGrowth model for finding
>>> related items.
>>>
>>> Number of transactions are 63K and the total number of items in all
>>> transactions are 200K.
>>>
>>> I am running FPGrowth model to generate frequent items sets. It is
>>> taking huge amount of time to generate frequent itemsets.* I am setting
>>> min-support value such that each item appears in at least ~(number of
>>> items)/(number of transactions).*
>>>
>>> It is taking lots of time in case If I say item can appear at least once
>>> in the database.
>>>
>>> If I give higher value to min-support then output is very smaller.
>>>
>>> Could anyone please guide me how to reduce the execution time for
>>> generating frequent items?
>>>
>>> --
>>> Thanks,
>>> Raju Bairishetti,
>>> www.lazada.com
>>>
>>
>>
>
>
> --
>
> --
> Thanks,
> Raju Bairishetti,
> www.lazada.com
>