Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Edward Capriolo
Here is a similar but not exact way I did something similar to what you
did. I had two data files in different formats the different columns needed
to be different features. I wanted to feed them into spark's:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm

This only works because I have a few named features, and they become fields
in the model object AntecedentUnion. This would be a crappy solution for a
large sparse matrix.

Also my Scala code is crap too so there is probably a better way to do this!


val b = targ.as[TargetingAntecedent]
val b1 = b.map(c => (c.tdid, c)).rdd.groupByKey()
val bgen = b1.map(f =>
  (f._1 , f._2.map
  ( x => AntecedentUnion("targeting", "", x.targetingdataid,
"", "") )
  ) )

val c = imp.as[ImpressionAntecedent]
val c1 = c.map(k => (k.tdid, k)).rdd.groupByKey()
val cgen = c1.map (f =>
  (f._1 , f._2.map
  ( x => AntecedentUnion("impression", "", "", x.campaignid,
x.adgroupid) ).toSet.toIterable
  ) )

val bgen = TargetingUtil.targetingAntecedent(sparkSession, sqlContext,
targ)
val cgen = TargetingUtil.impressionAntecedent(sparkSession, sqlContext,
imp)
val joined = bgen.join(cgen)

val merged = joined.map(f => (f._1, f._2._1++:(f._2._2) ))
val fullResults : RDD[Array[AntecedentUnion]] = merged.map(f =>
f._2).map(_.toArray[audacity.AntecedentUnion])


So essentially converting everything into AntecedentUnion where the first
column is the type of the tuple, and other fields are supplied or not. Then
merge all those and run fpgrowth on them. Hope that helps!



On Mon, May 15, 2017 at 12:06 PM, goun na  wrote:
>
> I mentioned it opposite. collect_list generates duplicated results.
>
> 2017-05-16 0:50 GMT+09:00 goun na :
>>
>> Hi, Jone Zhang
>>
>> 1. Hive UDF
>> You might need collect_set or collect_list (to eliminate duplication),
but make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>>
>> 2.Spark Dataframe Pivot
>>
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
>>
>> - Goun
>>
>> 2017-05-15 22:15 GMT+09:00 Jone Zhang :
>>>
>>> For example
>>> Data1(has 1 billion records)
>>> user_id1  feature1
>>> user_id1  feature2
>>>
>>> Data2(has 1 billion records)
>>> user_id1  feature3
>>>
>>> Data3(has 1 billion records)
>>> user_id1  feature4
>>> user_id1  feature5
>>> ...
>>> user_id1  feature100
>>>
>>> I want to get the result as follow
>>> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>>>
>>> Is there a more efficient way except join?
>>>
>>> Thanks!
>>
>>
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Edward Capriolo
You can not 'add jar' input formats and serde's. They need to be part of
your auxlib.

On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion  wrote:

> I tried now. still getting
>
> 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
> hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
>  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
> Serialization trace:
> inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
> aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>
>
> HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.
>
>
> On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure  wrote:
>
>> did you try -- jars property in spark submit? if your jar is of huge
>> size, you can pre-load the jar on all executors in a common available
>> directory to avoid network IO.
>>
>> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion 
>> wrote:
>>
>>> I' trying to add jars before running a query using hive on spark on cdh
>>> 5.4.3.
>>> I've tried applying the patch in
>>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch
>>> is done on a different hive version) but still hasn't succeeded.
>>>
>>> did anyone manage to do ADD JAR successfully with CDH?
>>>
>>> Thanks,
>>> Ophir
>>>
>>
>>
>