Re: Market Basket Analysis by deploying FP Growth algorithm

2017-04-05 Thread Patrick Plaatje
Hi Arun,

We have been running into the same issue (having only 1000 unique items, in 
100MM transactions), but have not investigated the root cause of this. We 
decided to run this on a cluster instead (4*16 / 64GB Ram), after which the OOM 
issue went away. However, we ran into the issue that the FPGrowth 
implementation starts spilling over to disk, and we had to increase the /tmp 
partition.

Hope it helps.

BR,
-patrick



On 05/04/2017, 10:29, "asethia"  wrote:

Hi,

We are currently working on a Market Basket Analysis by deploying FP Growth
algorithm on Spark to generate association rules for product recommendation.
We are running on close to 24 million invoices over an assortment of more
than 100k products. However, whenever we relax the support threshold below a
certain level, the stack overflows. We are using Spark 1.6.2 but can somehow
invoke 1.6.3 to counter this error. The problem though is even when we
invoke Spark 1.6.3 and increase the stack size to 100M we are running out of
memory. We believe the tree grows exponentially and is stored in memory
which causes this problem. Can anyone suggest a solution to this issue
please?

Thanks
Arun



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Market-Basket-Analysis-by-deploying-FP-Growth-algorithm-tp28569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: FP growth - Items in a transaction must be unique

2017-02-02 Thread Patrick Plaatje
Hi,

 

This indicates you have duplicate products per row in your dataframe, the FP 
implementation only allows unique products per row, so you will need to dedupe 
duplicate products before running the FPGrowth algorithm.

 

Best,

Patrick

 

From: "Devi P.V" 
Date: Thursday, 2 February 2017 at 07:17
To: "user @spark" 
Subject: FP growth - Items in a transaction must be unique

 

Hi all,

I am trying to run FP growth algorithm using spark and scala.sample input 
dataframe is following,

+---+
|productName

+---+
|Apple Iphone 7 128GB Jet Black with Facetime   

|Levi’s Blue Slim Fit Jeans- L5112,Rimmel London Lasting Finish Matte by Kate 
Moss 101 Dusky|
|Iphone 6 Plus (5.5",Limited Stocks, TRA Oman Approved) 

+---+

Each row contains unique items.

 

I converted it into rdd like following
val transactions = names.as[String].rdd.map(s =>s.split(","))

val fpg = new FPGrowth().
  setMinSupport(0.3).
  setNumPartitions(100)


val model = fpg.run(transactions)
But I got error

WARN TaskSetManager: Lost task 2.0 in stage 27.0 (TID 622, localhost):
org.apache.spark.SparkException: 
Items in a transaction must be unique but got WrappedArray(
Huawei GR3 Dual Sim 16GB 13MP 5Inch 4G,
 Huawei G8 Gold 32GB,  4G,  
5.5 Inches, HTC Desire 816 (Dual Sim, 3G, 8GB),
 Samsung Galaxy S7 Single Sim - 32GB,  4G LTE,  
Gold, Huawei P8 Lite 16GB,  4G LTE, Huawei Y625, 
Samsung Galaxy Note 5 - 32GB,  4G LTE, 
Samsung Galaxy S7 Dual Sim - 32GB)

How to solve this?

Thanks



 

 



Re: newbie unable to write to S3 403 forbidden error

2016-02-13 Thread Patrick Plaatje
Not sure if it’s related, but in our Hadoop configuration we’re also setting 

sc.hadoopConfiguration().set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem”);

Cheers,
-patrick

From:  Andy Davidson 
Date:  Friday, 12 February 2016 at 17:34
To:  Igor Berman 
Cc:  "user @spark" 
Subject:  Re: newbie unable to write to S3 403 forbidden error

Hi Igor

So I assume you are able to use s3 from spark? 

Do you use rdd.saveAsTextFile() ?

How did you create your cluster? I.E. Did you use the spark-1.6.0/spark-ec2 
script, EMR, or something else?


I tried several version of the url including no luck :-(

The bucket name is ‘com.ps.twitter’. It has a folder ‘son'

We have a developer support contract with amazon how ever our case has been 
unassigned for several days now

Thanks

Andy

P.s. In general debugging permission problems is always difficult from the 
client side. Secure servers do not want to make it easy for hackers

From:  Igor Berman 
Date:  Friday, February 12, 2016 at 4:53 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: newbie unable to write to S3 403 forbidden error

 String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/json” 

not sure, but 
can you try to remove s3-us-west-1.amazonaws.com from path ?

On 11 February 2016 at 23:15, Andy Davidson  
wrote:
I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am 
using the standalone cluster manager

My java streaming app is not able to write to s3. It appears to be some for of 
permission problem. 

Any idea what the problem might be?

I tried use the IAM simulator to test the policy. Everything seems okay. Any 
idea how I can debug this problem?

Thanks in advance

Andy

JavaSparkContext jsc = new JavaSparkContext(conf);


// I did not include the full key in my email
   // the keys do not contain ‘\’
   // these are the keys used to create the cluster. They belong to the IAM 
user andy
jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAJREX");

jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 
"uBh9v1hdUctI23uvq9qR");




  private static void saveTweets(JavaDStream jsonTweets, String 
outputURI) {

jsonTweets.foreachRDD(new VoidFunction2, Time>() {

private static final long serialVersionUID = 1L;



@Override

public void call(JavaRDD rdd, Time time) throws Exception {

if(!rdd.isEmpty()) {

// bucket name is ‘com.pws.twitter’ it has a folder ‘json'

String dirPath = 
"s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/json” + "-" + 
time.milliseconds();

rdd.saveAsTextFile(dirPath);

}

}

});




Bucket name : com.pws.titter
Bucket policy (I replaced the account id)

{
"Version": "2012-10-17",
"Id": "Policy1455148808376",
"Statement": [
{
"Sid": "Stmt1455148797805",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/andy"
},
"Action": "s3:*",
"Resource": "arn:aws:s3:::com.pws.twitter/*"
}
]
}






Getting top distinct strings from arraylist

2016-01-25 Thread Patrick Plaatje
Hi, 

I’m quite new to Spark and MR, but have a requirement to get all distinct 
values with their respective counts from a transactional file. Let’s assume the 
following file format:

0 1 2 3 4 5 6 7
1 3 4 5 8 9
9 10 11 12 13 14 15 16 17 18
1 4 7 11 12 13 19 20
3 4 7 11 15 20 21 22 23
1 2 5 9 11 12 16

Given this, I would like an ArrayList back, where the String 
is the item identifier and the Integer the count of that item identifier in the 
file. The following is what I came up with to map the values, but can’t figure 
out how to do the counting :(

// create RDD of an arraylist of strings

JavaRDD> transactions = sc.textFile(dataPath).map(

new Function>() {

private static final long serialVersionUID = 1L;

@Override

public ArrayList call(String s) {

return Lists.newArrayList(s.split(" "));

}

}

);


Any ideas?

Thanks!
Patrick