Re: Programatically create RDDs based on input

2015-10-31 Thread Natu Lauchande
Hi Amit,

I don't see any default constructor in the JavaRDD docs
https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html
.

Have you tried the following ?

JavaRDD jRDD[] ;

jRDD.add( jsc.textFile("/file1.txt") )
jRDD.add( jsc.textFile("/file2.txt") )
..
;

Natu


On Sat, Oct 31, 2015 at 11:18 PM, ayan guha  wrote:

> My java knowledge is limited, but you may try with a hashmap and put RDDs
> in it?
>
> On Sun, Nov 1, 2015 at 4:34 AM, amit tewari 
> wrote:
>
>> Thanks Ayan thats something similar to what I am looking at but trying
>> the same in Java is giving compile error:
>>
>> JavaRDD jRDD[] = new JavaRDD[3];
>>
>> //Error: Cannot create a generic array of JavaRDD
>>
>> Thanks
>> Amit
>>
>>
>>
>> On Sat, Oct 31, 2015 at 5:46 PM, ayan guha  wrote:
>>
>>> Corrected a typo...
>>>
>>> # In Driver
>>> fileList=["/file1.txt","/file2.txt"]
>>> rdds = []
>>> for f in fileList:
>>>  rdd = jsc.textFile(f)
>>>  rdds.append(rdd)
>>>
>>>
>>> On Sat, Oct 31, 2015 at 11:14 PM, ayan guha  wrote:
>>>
 Yes, this can be done. quick python equivalent:

 # In Driver
 fileList=["/file1.txt","/file2.txt"]
 rdd = []
 for f in fileList:
  rdd = jsc.textFile(f)
  rdds.append(rdd)



 On Sat, Oct 31, 2015 at 11:09 PM, amit tewari 
 wrote:

> Hi
>
> I need the ability to be able to create RDDs programatically inside my
> program (e.g. based on varaible number of input files).
>
> Can this be done?
>
> I need this as I want to run the following statement inside an
> iteration:
>
> JavaRDD rdd1 = jsc.textFile("/file1.txt");
>
> Thanks
> Amit
>



 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Assign unique link ID

2015-10-31 Thread Sarath Chandra
Hi All,

I have a hive table where data from 2 different sources (S1 and S2) get
accumulated. Sample data below -

*RECORD_ID|SOURCE_TYPE|TRN_NO|DATE1|DATE2|BRANCH|REF1|REF2|REF3|REF4|REF5|REF6|DC_FLAG|AMOUNT|CURRENCY*
*1|S1|55|19-Oct-2015|19-Oct-2015|25602|999||41106|47311|379|9|004|999||Cr|2672.00|INR*
*2|S1|55|19-Oct-2015|19-Oct-2015|81201|999||41106|9|379|9|004|999||Dr|2672.00|INR*
*3|S2|55|19-OCT-2015|19-OCT-2015|81201|999||41106|9|379|9|004|999||DR|2672|INR*
*4|S2|55|19-OCT-2015|19-OCT-2015|25602|999||41106|47311|379|9|004|999||CR|2672|INR*

I have a requirement to link similar records (same dates, branch and
reference numbers) source wise and assign them unique ID linking the 2
records. For example records 1 and 4 above should be linked with same ID.

I've written code below to segregate data source wise and join them based
on the similarities. But not knowing how to proceed further.

*var hc = new org.apache.spark.sql.hive.HiveContext(sc);*
*var src = hc.sql("select
RECORD_ID,SOURCE,TRN_NO,DATE1,DATE2,BRANCH,REF1,REF2,REF3,REF4,REF5,REF6,DC_FLAG,AMOUNT,CURRENCY
from src_table");*

*var s1 = src.filter("source_type='S1'");*

*var s2 = src.filter("source_type='S2'");*
*var src_join = s1.as ("S1").join(s2.as
("S2")).filter("(S1.TRN_NO= S2.TRN_NO) and (S1.DATE1=
S2.DATE1) and (S1.DATE2= S2.DATE2) and (S1.BRANCH= S2.BRANCH) and (S1.REF1=
S2.REF1) and (S1.REF2= S2.REF2) and (S1.REF3= S2.REF3) and (S1.REF4=
S2.REF4) and (S1.REF5= S2.REF5) and (S1.REF6= S2.REF6) and (S1.CURRENCY=
S2.CURRENCY)");*

Tried using a UDF which returns a random value or hashed string using
record IDs of both sides and include it to schema using withColumn, but
ended up getting duplicate link IDs.

Also when I use a UDF I'm not able to refer to the columns using the alias
in next steps. For example if I create a new DF using below line -
*var src_link = src_join.as
("SJ").withColumn("LINK_ID",
linkIDUDF(src_join("S1.RECORD_ID"),src("S2.RECORD_ID")));*
Then in further lines I'm not able to refer to "s1" columns from "src_link"
like -
*var src_link_s1 = src_link.as
("SL").select($"S1.RECORD_ID");*

Please guide me.

Regards,
Sarath.


Re: Assign unique link ID

2015-10-31 Thread ayan guha
Can this be a solution?

1. Write a function which will take a string and convert to md5 hash
2. From your base table, generate a string out of all columns you have used
for joining. So, records 1 and 4 should generate same hash value.
3. group by using this new id (you have already linked the records) and
pull out required fields.

Please let the group know if it works...

Best
Ayan

On Sat, Oct 31, 2015 at 6:44 PM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> Hi All,
>
> I have a hive table where data from 2 different sources (S1 and S2) get
> accumulated. Sample data below -
>
>
> *RECORD_ID|SOURCE_TYPE|TRN_NO|DATE1|DATE2|BRANCH|REF1|REF2|REF3|REF4|REF5|REF6|DC_FLAG|AMOUNT|CURRENCY*
>
> *1|S1|55|19-Oct-2015|19-Oct-2015|25602|999||41106|47311|379|9|004|999||Cr|2672.00|INR*
>
> *2|S1|55|19-Oct-2015|19-Oct-2015|81201|999||41106|9|379|9|004|999||Dr|2672.00|INR*
>
> *3|S2|55|19-OCT-2015|19-OCT-2015|81201|999||41106|9|379|9|004|999||DR|2672|INR*
>
> *4|S2|55|19-OCT-2015|19-OCT-2015|25602|999||41106|47311|379|9|004|999||CR|2672|INR*
>
> I have a requirement to link similar records (same dates, branch and
> reference numbers) source wise and assign them unique ID linking the 2
> records. For example records 1 and 4 above should be linked with same ID.
>
> I've written code below to segregate data source wise and join them based
> on the similarities. But not knowing how to proceed further.
>
> *var hc = new org.apache.spark.sql.hive.HiveContext(sc);*
> *var src = hc.sql("select
> RECORD_ID,SOURCE,TRN_NO,DATE1,DATE2,BRANCH,REF1,REF2,REF3,REF4,REF5,REF6,DC_FLAG,AMOUNT,CURRENCY
> from src_table");*
>
> *var s1 = src.filter("source_type='S1'");*
>
> *var s2 = src.filter("source_type='S2'");*
> *var src_join = s1.as ("S1").join(s2.as
> ("S2")).filter("(S1.TRN_NO= S2.TRN_NO) and (S1.DATE1=
> S2.DATE1) and (S1.DATE2= S2.DATE2) and (S1.BRANCH= S2.BRANCH) and (S1.REF1=
> S2.REF1) and (S1.REF2= S2.REF2) and (S1.REF3= S2.REF3) and (S1.REF4=
> S2.REF4) and (S1.REF5= S2.REF5) and (S1.REF6= S2.REF6) and (S1.CURRENCY=
> S2.CURRENCY)");*
>
> Tried using a UDF which returns a random value or hashed string using
> record IDs of both sides and include it to schema using withColumn, but
> ended up getting duplicate link IDs.
>
> Also when I use a UDF I'm not able to refer to the columns using the alias
> in next steps. For example if I create a new DF using below line -
> *var src_link = src_join.as
> ("SJ").withColumn("LINK_ID",
> linkIDUDF(src_join("S1.RECORD_ID"),src("S2.RECORD_ID")));*
> Then in further lines I'm not able to refer to "s1" columns from
> "src_link" like -
> *var src_link_s1 = src_link.as
> ("SL").select($"S1.RECORD_ID");*
>
> Please guide me.
>
> Regards,
> Sarath.
>



-- 
Best Regards,
Ayan Guha


Re: Spark tunning increase number of active tasks

2015-10-31 Thread Jörn Franke
Maybe Hortonworks support can help you much better.

Otherwise you may want to change the yarn scheduler configuration and 
preemption. Do you use something like speculative execution?

How do you start execution of the programs? Maybe you are already using all 
cores of the master...

> On 30 Oct 2015, at 23:32, YI, XIAOCHUAN  wrote:
> 
> Hi
> Our team has a 40 node hortonworks Hadoop cluster 2.2.4.2-2  (36 data node) 
> with apache spark 1.2 and 1.4 installed.
> Each node has 64G RAM and 8 cores.
>  
> We are only able to use <= 72 executors with executor-cores=2
> So we are only get 144 active tasks running pyspark programs with pyspark.
> [Stage 1:===>(596 + 144) / 
> 2042]
> IF we use larger number for --num-executors, the pyspark program exit with 
> errors:
> ERROR YarnScheduler: Lost executor 113 on hag017.example.com: remote Rpc 
> client disassociated
>  
> I tried spark 1.4 and conf.set("dynamicAllocation.enabled", "true"). However 
> it does not help us to increase the number of active tasks.
> I expect larger number of active tasks with the cluster we have.
> Could anyone advise on this? Thank you very much!
>  
> Shaun
>  


Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-31 Thread Rex Xiong
Add back this thread to email list, forgot to reply all.
2015年10月31日 下午7:23,"Michael Armbrust" 写道:

> Not that I know of.
>
> On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong  wrote:
>
>> Good to know that, will have a try.
>> So there is no easy way to achieve it in pure hive method?
>> 2015年10月31日 下午7:17,"Michael Armbrust" 写道:
>>
>>> Yeah, this was rewritten to be faster in Spark 1.5.  We use it with
>>> 10,000s of partitions.
>>>
>>> On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong  wrote:
>>>
 1.3.1
 It is a lot of improvement in 1.5+?

 2015-10-30 19:23 GMT+08:00 Michael Armbrust :

> We have tried schema merging feature, but it's too slow, there're
>> hundreds of partitions.
>>
> Which version of Spark?
>


>>>
>


Programatically create RDDs based on input

2015-10-31 Thread amit tewari
Hi

I need the ability to be able to create RDDs programatically inside my
program (e.g. based on varaible number of input files).

Can this be done?

I need this as I want to run the following statement inside an iteration:

JavaRDD rdd1 = jsc.textFile("/file1.txt");

Thanks
Amit


job hangs when using pipe() with reduceByKey()

2015-10-31 Thread hotdog
I meet a situation:
When I use 
val a = rdd.pipe("./my_cpp_program").persist()
a.count()  // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast

but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
it is so slow
and there are many such log in my executors:
15/10/31 19:53:58 INFO collection.ExternalSorter: Thread 78 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:14 INFO collection.ExternalSorter: Thread 74 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:17 INFO collection.ExternalSorter: Thread 79 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:29 INFO collection.ExternalSorter: Thread 77 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:50 INFO collection.ExternalSorter: Thread 76 spilling
in-memory map of 633.1 MB to disk (9 times so far)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pivot Data in Spark and Scala

2015-10-31 Thread ayan guha
(disclaimer: my reply in SO)

http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes/30278605#30278605


On Sat, Oct 31, 2015 at 6:21 AM, Ali Tajeldin EDU 
wrote:

> You can take a look at the smvPivot function in the SMV library (
> https://github.com/TresAmigosSD/SMV ).  Should look for method "smvPivot"
> in SmvDFHelper (
>
> http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvDFHelper).
> You can also perform the pivot on a group-by-group basis.  See smvPivot and
> smvPivotSum in SmvGroupedDataFunc (
> http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc
> ).
>
> Docs from smvPivotSum are copied below.  Note that you don't have to
> specify the baseOutput columns, but if you don't, it will force an
> additional action on the input data frame to build the cross products of
> all possible values in your input pivot columns.
>
> Perform a normal SmvPivot operation followed by a sum on all the output
> pivot columns.
> For example:
>
> df.smvGroupBy("id").smvPivotSum(Seq("month", "product"))("count")("5_14_A", 
> "5_14_B", "6_14_A", "6_14_B")
>
> and the following input:
>
> Input
> | id  | month | product | count |
> | --- | - | --- | - |
> | 1   | 5/14  |   A |   100 |
> | 1   | 6/14  |   B |   200 |
> | 1   | 5/14  |   B |   300 |
>
> will produce the following output:
>
> | id  | count_5_14_A | count_5_14_B | count_6_14_A | count_6_14_B |
> | --- |  |  |  |  |
> | 1   | 100  | 300  | NULL | 200  |
>
> pivotCols
> The sequence of column names whose values will be used as the output pivot
> column names.
> valueCols
> The columns whose value will be copied to the pivoted output columns.
> baseOutput
> The expected base output column names (without the value column prefix).
> The user is required to supply the list of expected pivot column output
> names to avoid and extra action on the input DataFrame just to extract the
> possible pivot columns. if an empty sequence is provided, then the base
> output columns will be extracted from values in the pivot columns (will
> cause an action on the entire DataFrame!)
>
> --
> Ali
> PS: shoot me an email if you run into any issues using SMV.
>
>
> On Oct 30, 2015, at 6:33 AM, Andrianasolo Fanilo <
> fanilo.andrianas...@worldline.com> wrote:
>
> Hey,
>
> The question is tricky, here is a possible answer by defining years as
> keys for a hashmap per client and merging those :
>
>
> *import *scalaz._
> *import *Scalaz._
>
>
> *val *sc = *new *SparkContext(*"local[*]"*, *"sandbox"*)
>
>
> *// Create RDD of your objects**val *rdd = sc.parallelize(*Seq*(
>   (*"A"*, 2015, 4),
>   (*"A"*, 2014, 12),
>   (*"A"*, 2013, 1),
>   (*"B"*, 2015, 24),
>   (*"B"*, 2013, 4)
> ))
>
>
> *// Search for all the years in the RDD**val *minYear =
> rdd.map(_._2).reduce(Math.*min*)
> *// look for minimum year**val *maxYear = rdd.map(_._2).reduce(Math.*max*
> )
> *// look for maximum year**val *sequenceOfYears = maxYear to minYear by -1
>
>
>
> *// create sequence of years from max to min// Define functions to build,
> for each client, a Map of year -> value for year, and how those maps will
> be merged**def *createCombiner(obj: (Int, Int)): Map[Int, String] = 
> *Map*(obj._1
> -> obj._2.toString)
> *def *mergeValue(accum: Map[Int, String], obj: (Int, Int)) = accum +
> (obj._1 -> obj._2.toString)
> *def *mergeCombiners(accum1: Map[Int, String], accum2: Map[Int, String]) =
>  accum1 |+| accum2 *// I’m lazy so I use Scalaz to merge two maps of year
> -> value, I assume we don’t have two lines with same client and year…*
>
>
> *// For each client, check for each year from maxYear to minYear if it
> exists in the computed map. If not input blank.**val *result = rdd
>   .map { *case *obj => (obj._1, (obj._2, obj._3)) }
>   .combineByKey(createCombiner, mergeValue, mergeCombiners)
>   .map{ *case *(name, mapOfYearsToValues) => (*Seq*(name) ++
> sequenceOfYears.map(year => mapOfYearsToValues.getOrElse(year, *" "*
> ))).mkString(*","*)}* // here we assume that sequence of all years isn’t
> too big to not fit in memory. If you had to compute for each day, it may
> break and you would definitely need to use a specialized timeseries
> library…*
>
> result.foreach(*println*)
>
> sc.stop()
>
> Best regards,
> Fanilo
>
> *De :* Adrian Tanase [mailto:atan...@adobe.com]
> *Envoyé :* vendredi 30 octobre 2015 11:50
> *À :* Deng Ching-Mallete; Ascot Moss
> *Cc :* User
> *Objet :* Re: Pivot Data in Spark and Scala
>
> Its actually a bit tougher as you’ll first need all the years. Also not
> sure how you would reprsent your “columns” given they are dynamic based on
> the input data.
>
> Depending on your downstream processing, I’d probably try to emulate it
> with a hash map with years as keys instead of the columns.
>
> There is probably a nicer solution using the data 

Re: Pulling data from a secured SQL database

2015-10-31 Thread Michael Armbrust
I would try using the JDBC Data Source

and save the data to parquet
.
You can then put that data on your Spark cluster (probably install HDFS).

On Fri, Oct 30, 2015 at 6:49 PM, Thomas Ginter 
wrote:

> I am working in an environment where data is stored in MS SQL Server.  It
> has been secured so that only a specific set of machines can access the
> database through an integrated security Microsoft JDBC connection.  We also
> have a couple of beefy linux machines we can use to host a Spark cluster
> but those machines do not have access to the databases directly.  How can I
> pull the data from the SQL database on the smaller development machine and
> then have it distribute to the Spark cluster for processing?  Can the
> driver pull data and then distribute execution?
>
> Thanks,
>
> Thomas Ginter
> 801-448-7676
> thomas.gin...@utah.edu
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
Yes, this can be done. quick python equivalent:

# In Driver
fileList=["/file1.txt","/file2.txt"]
rdd = []
for f in fileList:
 rdd = jsc.textFile(f)
 rdds.append(rdd)



On Sat, Oct 31, 2015 at 11:09 PM, amit tewari 
wrote:

> Hi
>
> I need the ability to be able to create RDDs programatically inside my
> program (e.g. based on varaible number of input files).
>
> Can this be done?
>
> I need this as I want to run the following statement inside an iteration:
>
> JavaRDD rdd1 = jsc.textFile("/file1.txt");
>
> Thanks
> Amit
>



-- 
Best Regards,
Ayan Guha


Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-31 Thread Michael Armbrust
This is a bug in DataFrame caching.  You can avoid caching or turn off
compression.  It is fixed in Spark 1.5.1

On Sat, Oct 31, 2015 at 2:31 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> I don’t believe I have it on 1.5.1. Are you able to test the data locally
> to confirm, or is it too large?
>
> From: "Zhang, Jingyu" 
> Date: Friday, October 30, 2015 at 7:31 PM
> To: Silvio Fiorito 
> Cc: Ted Yu , user 
>
> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0
>
> Thanks Silvio and Ted,
>
> Can you please let me know how to fix this intermittent issues? Should I
> wait EMR upgrading to support the Spark 1.5.1 or change my code from
> DataFrame to normal Spark map-reduce?
>
> Regards,
>
> Jingyu
>
> On 31 October 2015 at 09:40, Silvio Fiorito  > wrote:
>
>> It's something due to the columnar compression. I've seen similar
>> intermittent issues when caching DataFrames. "sportingpulse.com" is a
>> value in one of the columns of the DF.
>> --
>> From: Ted Yu 
>> Sent: ‎10/‎30/‎2015 6:33 PM
>> To: Zhang, Jingyu 
>> Cc: user 
>> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0
>>
>> I searched for sportingpulse in *.scala and *.java files under 1.5
>> branch.
>> There was no hit.
>>
>> mvn dependency doesn't show sportingpulse either.
>>
>> Is it possible this is specific to EMR ?
>>
>> Cheers
>>
>> On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu 
>> wrote:
>>
>>> There is not a problem in Spark SQL 1.5.1 but the error of "key not
>>> found: sportingpulse.com" shown up when I use 1.5.0.
>>>
>>> I have to use the version of 1.5.0 because that the one AWS EMR
>>> support.  Can anyone tell me why Spark uses "sportingpulse.com" and how
>>> to fix it?
>>>
>>> Thanks.
>>>
>>> Caused by: java.util.NoSuchElementException: key not found:
>>> sportingpulse.com
>>>
>>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>>>
>>> at scala.collection.AbstractMap.default(Map.scala:58)
>>>
>>> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>>>
>>> at
>>> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(
>>> compressionSchemes.scala:258)
>>>
>>> at
>>> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(
>>> CompressibleColumnBuilder.scala:110)
>>>
>>> at org.apache.spark.sql.columnar.NativeColumnBuilder.build(
>>> ColumnBuilder.scala:87)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
>>> InMemoryColumnarTableScan.scala:152)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
>>> InMemoryColumnarTableScan.scala:152)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.IndexedSeqOptimized$class.foreach(
>>> IndexedSeqOptimized.scala:33)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>>
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>>> InMemoryColumnarTableScan.scala:152)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>>> InMemoryColumnarTableScan.scala:120)
>>>
>>> at org.apache.spark.storage.MemoryStore.unrollSafely(
>>> MemoryStore.scala:278)
>>>
>>> at org.apache.spark.CacheManager.putInBlockManager(
>>> CacheManager.scala:171)
>>>
>>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(
>>> MapPartitionsWithPreparationRDD.scala:63)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at 

Re: Pulling data from a secured SQL database

2015-10-31 Thread Deenar Toraskar
Thomas

I have the same problem, though in my case getting Kerberos authentication
to MSSQLServer from the cluster nodes does not seem to be supported. There
are a couple of options that come to mind.

1) You can pull the data running sqoop in local mode on the smaller
development machines and write to HDFS or to a persistent store connected
to your Spark cluster.
2) You can run Spark in local mode on the smaller development machines and
use JDBC Data Source and do something similar.

Regards
Deenar

*Think Reactive Ltd*
deenar.toras...@thinkreactive.co.uk
07714140812




On 31 October 2015 at 11:35, Michael Armbrust 
wrote:

> I would try using the JDBC Data Source
> 
> and save the data to parquet
> .
> You can then put that data on your Spark cluster (probably install HDFS).
>
> On Fri, Oct 30, 2015 at 6:49 PM, Thomas Ginter 
> wrote:
>
>> I am working in an environment where data is stored in MS SQL Server.  It
>> has been secured so that only a specific set of machines can access the
>> database through an integrated security Microsoft JDBC connection.  We also
>> have a couple of beefy linux machines we can use to host a Spark cluster
>> but those machines do not have access to the databases directly.  How can I
>> pull the data from the SQL database on the smaller development machine and
>> then have it distribute to the Spark cluster for processing?  Can the
>> driver pull data and then distribute execution?
>>
>> Thanks,
>>
>> Thomas Ginter
>> 801-448-7676
>> thomas.gin...@utah.edu
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Assign unique link ID

2015-10-31 Thread Sarath Chandra
Thanks for the reply Ayan.

I got this idea earlier but the problem is the number of columns used for
joining will be varying depending on the some data conditions. Also their
data types will be different. So I'm not getting how to define the UDF as
we need to upfront specify the argument count and their types.

Any ideas how to tackle this?

Regards,
Sarath.

On Sat, Oct 31, 2015 at 4:37 PM, ayan guha  wrote:

> Can this be a solution?
>
> 1. Write a function which will take a string and convert to md5 hash
> 2. From your base table, generate a string out of all columns you have
> used for joining. So, records 1 and 4 should generate same hash value.
> 3. group by using this new id (you have already linked the records) and
> pull out required fields.
>
> Please let the group know if it works...
>
> Best
> Ayan
>
> On Sat, Oct 31, 2015 at 6:44 PM, Sarath Chandra <
> sarathchandra.jos...@algofusiontech.com> wrote:
>
>> Hi All,
>>
>> I have a hive table where data from 2 different sources (S1 and S2) get
>> accumulated. Sample data below -
>>
>>
>> *RECORD_ID|SOURCE_TYPE|TRN_NO|DATE1|DATE2|BRANCH|REF1|REF2|REF3|REF4|REF5|REF6|DC_FLAG|AMOUNT|CURRENCY*
>>
>> *1|S1|55|19-Oct-2015|19-Oct-2015|25602|999||41106|47311|379|9|004|999||Cr|2672.00|INR*
>>
>> *2|S1|55|19-Oct-2015|19-Oct-2015|81201|999||41106|9|379|9|004|999||Dr|2672.00|INR*
>>
>> *3|S2|55|19-OCT-2015|19-OCT-2015|81201|999||41106|9|379|9|004|999||DR|2672|INR*
>>
>> *4|S2|55|19-OCT-2015|19-OCT-2015|25602|999||41106|47311|379|9|004|999||CR|2672|INR*
>>
>> I have a requirement to link similar records (same dates, branch and
>> reference numbers) source wise and assign them unique ID linking the 2
>> records. For example records 1 and 4 above should be linked with same ID.
>>
>> I've written code below to segregate data source wise and join them based
>> on the similarities. But not knowing how to proceed further.
>>
>> *var hc = new org.apache.spark.sql.hive.HiveContext(sc);*
>> *var src = hc.sql("select
>> RECORD_ID,SOURCE,TRN_NO,DATE1,DATE2,BRANCH,REF1,REF2,REF3,REF4,REF5,REF6,DC_FLAG,AMOUNT,CURRENCY
>> from src_table");*
>>
>> *var s1 = src.filter("source_type='S1'");*
>>
>> *var s2 = src.filter("source_type='S2'");*
>> *var src_join = s1.as ("S1").join(s2.as
>> ("S2")).filter("(S1.TRN_NO= S2.TRN_NO) and (S1.DATE1=
>> S2.DATE1) and (S1.DATE2= S2.DATE2) and (S1.BRANCH= S2.BRANCH) and (S1.REF1=
>> S2.REF1) and (S1.REF2= S2.REF2) and (S1.REF3= S2.REF3) and (S1.REF4=
>> S2.REF4) and (S1.REF5= S2.REF5) and (S1.REF6= S2.REF6) and (S1.CURRENCY=
>> S2.CURRENCY)");*
>>
>> Tried using a UDF which returns a random value or hashed string using
>> record IDs of both sides and include it to schema using withColumn, but
>> ended up getting duplicate link IDs.
>>
>> Also when I use a UDF I'm not able to refer to the columns using the
>> alias in next steps. For example if I create a new DF using below line -
>> *var src_link = src_join.as
>> ("SJ").withColumn("LINK_ID",
>> linkIDUDF(src_join("S1.RECORD_ID"),src("S2.RECORD_ID")));*
>> Then in further lines I'm not able to refer to "s1" columns from
>> "src_link" like -
>> *var src_link_s1 = src_link.as
>> ("SL").select($"S1.RECORD_ID");*
>>
>> Please guide me.
>>
>> Regards,
>> Sarath.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Assign unique link ID

2015-10-31 Thread ayan guha
Hi

The way I see it, your dedup condition needs to be defined. If you have it
variable, then the joining approach is no good either. You may want to stub
columns (like putting a default value in the joining clause) to achieve
this. If not, you would probably state the problem with all other
conditions so we can discuss further?

Getting a partition key upfront will be important in your case to control
shuffle.

Best
Ayan

On Sat, Oct 31, 2015 at 11:54 PM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> Thanks for the reply Ayan.
>
> I got this idea earlier but the problem is the number of columns used for
> joining will be varying depending on the some data conditions. Also their
> data types will be different. So I'm not getting how to define the UDF as
> we need to upfront specify the argument count and their types.
>
> Any ideas how to tackle this?
>
> Regards,
> Sarath.
>
> On Sat, Oct 31, 2015 at 4:37 PM, ayan guha  wrote:
>
>> Can this be a solution?
>>
>> 1. Write a function which will take a string and convert to md5 hash
>> 2. From your base table, generate a string out of all columns you have
>> used for joining. So, records 1 and 4 should generate same hash value.
>> 3. group by using this new id (you have already linked the records) and
>> pull out required fields.
>>
>> Please let the group know if it works...
>>
>> Best
>> Ayan
>>
>> On Sat, Oct 31, 2015 at 6:44 PM, Sarath Chandra <
>> sarathchandra.jos...@algofusiontech.com> wrote:
>>
>>> Hi All,
>>>
>>> I have a hive table where data from 2 different sources (S1 and S2) get
>>> accumulated. Sample data below -
>>>
>>>
>>> *RECORD_ID|SOURCE_TYPE|TRN_NO|DATE1|DATE2|BRANCH|REF1|REF2|REF3|REF4|REF5|REF6|DC_FLAG|AMOUNT|CURRENCY*
>>>
>>> *1|S1|55|19-Oct-2015|19-Oct-2015|25602|999||41106|47311|379|9|004|999||Cr|2672.00|INR*
>>>
>>> *2|S1|55|19-Oct-2015|19-Oct-2015|81201|999||41106|9|379|9|004|999||Dr|2672.00|INR*
>>>
>>> *3|S2|55|19-OCT-2015|19-OCT-2015|81201|999||41106|9|379|9|004|999||DR|2672|INR*
>>>
>>> *4|S2|55|19-OCT-2015|19-OCT-2015|25602|999||41106|47311|379|9|004|999||CR|2672|INR*
>>>
>>> I have a requirement to link similar records (same dates, branch and
>>> reference numbers) source wise and assign them unique ID linking the 2
>>> records. For example records 1 and 4 above should be linked with same ID.
>>>
>>> I've written code below to segregate data source wise and join them
>>> based on the similarities. But not knowing how to proceed further.
>>>
>>> *var hc = new org.apache.spark.sql.hive.HiveContext(sc);*
>>> *var src = hc.sql("select
>>> RECORD_ID,SOURCE,TRN_NO,DATE1,DATE2,BRANCH,REF1,REF2,REF3,REF4,REF5,REF6,DC_FLAG,AMOUNT,CURRENCY
>>> from src_table");*
>>>
>>> *var s1 = src.filter("source_type='S1'");*
>>>
>>> *var s2 = src.filter("source_type='S2'");*
>>> *var src_join = s1.as ("S1").join(s2.as
>>> ("S2")).filter("(S1.TRN_NO= S2.TRN_NO) and (S1.DATE1=
>>> S2.DATE1) and (S1.DATE2= S2.DATE2) and (S1.BRANCH= S2.BRANCH) and (S1.REF1=
>>> S2.REF1) and (S1.REF2= S2.REF2) and (S1.REF3= S2.REF3) and (S1.REF4=
>>> S2.REF4) and (S1.REF5= S2.REF5) and (S1.REF6= S2.REF6) and (S1.CURRENCY=
>>> S2.CURRENCY)");*
>>>
>>> Tried using a UDF which returns a random value or hashed string using
>>> record IDs of both sides and include it to schema using withColumn, but
>>> ended up getting duplicate link IDs.
>>>
>>> Also when I use a UDF I'm not able to refer to the columns using the
>>> alias in next steps. For example if I create a new DF using below line -
>>> *var src_link = src_join.as
>>> ("SJ").withColumn("LINK_ID",
>>> linkIDUDF(src_join("S1.RECORD_ID"),src("S2.RECORD_ID")));*
>>> Then in further lines I'm not able to refer to "s1" columns from
>>> "src_link" like -
>>> *var src_link_s1 = src_link.as
>>> ("SL").select($"S1.RECORD_ID");*
>>>
>>> Please guide me.
>>>
>>> Regards,
>>> Sarath.
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: Spark tunning increase number of active tasks

2015-10-31 Thread Sandy Ryza
Hi Xiaochuan,

The most likely cause of the "Lost container" issue is that YARN is killing
container for exceeding memory limits.  If this is the case, you should be
able to find instances of "exceeding memory limits" in the application
logs.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
has a more detailed explanation of why this happens.

-Sandy

On Sat, Oct 31, 2015 at 4:29 AM, Jörn Franke  wrote:

> Maybe Hortonworks support can help you much better.
>
> Otherwise you may want to change the yarn scheduler configuration and
> preemption. Do you use something like speculative execution?
>
> How do you start execution of the programs? Maybe you are already using
> all cores of the master...
>
> On 30 Oct 2015, at 23:32, YI, XIAOCHUAN  wrote:
>
> Hi
>
> Our team has a 40 node hortonworks Hadoop cluster 2.2.4.2-2  (36 data
> node) with apache spark 1.2 and 1.4 installed.
>
> Each node has 64G RAM and 8 cores.
>
>
>
> We are only able to use <= 72 executors with executor-cores=2
>
> So we are only get 144 active tasks running pyspark programs with pyspark.
>
> [Stage 1:===>(596 + 144) /
> 2042]
>
> IF we use larger number for --num-executors, the pyspark program exit with
> errors:
>
> ERROR YarnScheduler: Lost executor 113 on hag017.example.com: remote Rpc
> client disassociated
>
>
>
> I tried spark 1.4 and conf.set("dynamicAllocation.enabled", "true").
> However it does not help us to increase the number of active tasks.
>
> I expect larger number of active tasks with the cluster we have.
>
> Could anyone advise on this? Thank you very much!
>
>
>
> Shaun
>
>
>
>


Re: Programatically create RDDs based on input

2015-10-31 Thread amit tewari
Thanks Ayan thats something similar to what I am looking at but trying the
same in Java is giving compile error:

JavaRDD jRDD[] = new JavaRDD[3];

//Error: Cannot create a generic array of JavaRDD

Thanks
Amit



On Sat, Oct 31, 2015 at 5:46 PM, ayan guha  wrote:

> Corrected a typo...
>
> # In Driver
> fileList=["/file1.txt","/file2.txt"]
> rdds = []
> for f in fileList:
>  rdd = jsc.textFile(f)
>  rdds.append(rdd)
>
>
> On Sat, Oct 31, 2015 at 11:14 PM, ayan guha  wrote:
>
>> Yes, this can be done. quick python equivalent:
>>
>> # In Driver
>> fileList=["/file1.txt","/file2.txt"]
>> rdd = []
>> for f in fileList:
>>  rdd = jsc.textFile(f)
>>  rdds.append(rdd)
>>
>>
>>
>> On Sat, Oct 31, 2015 at 11:09 PM, amit tewari 
>> wrote:
>>
>>> Hi
>>>
>>> I need the ability to be able to create RDDs programatically inside my
>>> program (e.g. based on varaible number of input files).
>>>
>>> Can this be done?
>>>
>>> I need this as I want to run the following statement inside an iteration:
>>>
>>> JavaRDD rdd1 = jsc.textFile("/file1.txt");
>>>
>>> Thanks
>>> Amit
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Having written a post on last Tuesday, I'm still not able to see my post
under nabble. And yeah, subscription to u...@apache.spark.org was
successful (rechecked a minute ago)

Even more, I have no way (and no confirmation) that my post was accepted,
rejected, whatever.

This is very L4M3 and so 80ies.

Any help appreciated. Thx!


Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Thanks Nicholas for clarifying.

Having said, it's not about blaming but about improving.

The fact that my post from Tuesday is not visible on nabble and that I
received no answer let's me doubt it got posted correctl. On the other hand
you can read my recent post.  just irritated.

Hope to see get things improved ...

Cheers, Martin
Am 31.10.2015 17:34 schrieb "Nicholas Chammas" :

> Nabble is an unofficial archive of this mailing list. I don't know who
> runs it, but it's not Apache. There are often delays between when things
> get posted to the list and updated on Nabble, and sometimes things never
> make it over for whatever reason.
>
> This mailing list is, I agree, very 1980s. Unfortunately, it's required by
> the Apache Software Foundation (ASF).
>
> There was a discussion earlier this year
> 
>  about
> migrating to Discourse that explained why we're stuck with what we have for
> now. Ironically, that discussion is hard to follow on the Apache archives
> (which is precisely one of the motivations for proposing to migrate to
> Discourse), but there is a more readable archive on another unofficial
> site
> 
> .
>
> Nick
>
> On Sat, Oct 31, 2015 at 12:20 PM Martin Senne 
> wrote:
>
>> Having written a post on last Tuesday, I'm still not able to see my post
>> under nabble. And yeah, subscription to u...@apache.spark.org was
>> successful (rechecked a minute ago)
>>
>> Even more, I have no way (and no confirmation) that my post was accepted,
>> rejected, whatever.
>>
>> This is very L4M3 and so 80ies.
>>
>> Any help appreciated. Thx!
>>
>


Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
My java knowledge is limited, but you may try with a hashmap and put RDDs
in it?

On Sun, Nov 1, 2015 at 4:34 AM, amit tewari  wrote:

> Thanks Ayan thats something similar to what I am looking at but trying the
> same in Java is giving compile error:
>
> JavaRDD jRDD[] = new JavaRDD[3];
>
> //Error: Cannot create a generic array of JavaRDD
>
> Thanks
> Amit
>
>
>
> On Sat, Oct 31, 2015 at 5:46 PM, ayan guha  wrote:
>
>> Corrected a typo...
>>
>> # In Driver
>> fileList=["/file1.txt","/file2.txt"]
>> rdds = []
>> for f in fileList:
>>  rdd = jsc.textFile(f)
>>  rdds.append(rdd)
>>
>>
>> On Sat, Oct 31, 2015 at 11:14 PM, ayan guha  wrote:
>>
>>> Yes, this can be done. quick python equivalent:
>>>
>>> # In Driver
>>> fileList=["/file1.txt","/file2.txt"]
>>> rdd = []
>>> for f in fileList:
>>>  rdd = jsc.textFile(f)
>>>  rdds.append(rdd)
>>>
>>>
>>>
>>> On Sat, Oct 31, 2015 at 11:09 PM, amit tewari 
>>> wrote:
>>>
 Hi

 I need the ability to be able to create RDDs programatically inside my
 program (e.g. based on varaible number of input files).

 Can this be done?

 I need this as I want to run the following statement inside an
 iteration:

 JavaRDD rdd1 = jsc.textFile("/file1.txt");

 Thanks
 Amit

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re:Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread 李森栋
spark 1.4.1
hadoop 2.6.0
centos 6.6






At 2015-10-31 23:14:46, "Ted Yu"  wrote:

Which Spark release are you using ?


Which OS ?


Thanks


On Sat, Oct 31, 2015 at 5:18 AM, hotdog  wrote:
I meet a situation:
When I use
val a = rdd.pipe("./my_cpp_program").persist()
a.count()  // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast

but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
it is so slow
and there are many such log in my executors:
15/10/31 19:53:58 INFO collection.ExternalSorter: Thread 78 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:14 INFO collection.ExternalSorter: Thread 74 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:17 INFO collection.ExternalSorter: Thread 79 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:29 INFO collection.ExternalSorter: Thread 77 spilling
in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:50 INFO collection.ExternalSorter: Thread 76 spilling
in-memory map of 633.1 MB to disk (9 times so far)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread Ted Yu
Which Spark release are you using ?

Which OS ?

Thanks

On Sat, Oct 31, 2015 at 5:18 AM, hotdog  wrote:

> I meet a situation:
> When I use
> val a = rdd.pipe("./my_cpp_program").persist()
> a.count()  // just use it to persist a
> val b = a.map(s => (s, 1)).reduceByKey().count()
> it 's so fast
>
> but when I use
> val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
> it is so slow
> and there are many such log in my executors:
> 15/10/31 19:53:58 INFO collection.ExternalSorter: Thread 78 spilling
> in-memory map of 633.1 MB to disk (8 times so far)
> 15/10/31 19:54:14 INFO collection.ExternalSorter: Thread 74 spilling
> in-memory map of 633.1 MB to disk (8 times so far)
> 15/10/31 19:54:17 INFO collection.ExternalSorter: Thread 79 spilling
> in-memory map of 633.1 MB to disk (8 times so far)
> 15/10/31 19:54:29 INFO collection.ExternalSorter: Thread 77 spilling
> in-memory map of 633.1 MB to disk (8 times so far)
> 15/10/31 19:54:50 INFO collection.ExternalSorter: Thread 76 spilling
> in-memory map of 633.1 MB to disk (9 times so far)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Nicholas Chammas
Nabble is an unofficial archive of this mailing list. I don't know who runs
it, but it's not Apache. There are often delays between when things get
posted to the list and updated on Nabble, and sometimes things never make
it over for whatever reason.

This mailing list is, I agree, very 1980s. Unfortunately, it's required by
the Apache Software Foundation (ASF).

There was a discussion earlier this year

about
migrating to Discourse that explained why we're stuck with what we have for
now. Ironically, that discussion is hard to follow on the Apache archives
(which is precisely one of the motivations for proposing to migrate to
Discourse), but there is a more readable archive on another unofficial site

.

Nick

On Sat, Oct 31, 2015 at 12:20 PM Martin Senne 
wrote:

> Having written a post on last Tuesday, I'm still not able to see my post
> under nabble. And yeah, subscription to u...@apache.spark.org was
> successful (rechecked a minute ago)
>
> Even more, I have no way (and no confirmation) that my post was accepted,
> rejected, whatever.
>
> This is very L4M3 and so 80ies.
>
> Any help appreciated. Thx!
>


Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Ted Yu
>From the result of http://search-hadoop.com/?q=spark+Martin+Senne ,
Martin's post Tuesday didn't go through.

FYI

On Sat, Oct 31, 2015 at 9:34 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Nabble is an unofficial archive of this mailing list. I don't know who
> runs it, but it's not Apache. There are often delays between when things
> get posted to the list and updated on Nabble, and sometimes things never
> make it over for whatever reason.
>
> This mailing list is, I agree, very 1980s. Unfortunately, it's required by
> the Apache Software Foundation (ASF).
>
> There was a discussion earlier this year
> 
>  about
> migrating to Discourse that explained why we're stuck with what we have for
> now. Ironically, that discussion is hard to follow on the Apache archives
> (which is precisely one of the motivations for proposing to migrate to
> Discourse), but there is a more readable archive on another unofficial
> site
> 
> .
>
> Nick
>
> On Sat, Oct 31, 2015 at 12:20 PM Martin Senne 
> wrote:
>
>> Having written a post on last Tuesday, I'm still not able to see my post
>> under nabble. And yeah, subscription to u...@apache.spark.org was
>> successful (rechecked a minute ago)
>>
>> Even more, I have no way (and no confirmation) that my post was accepted,
>> rejected, whatever.
>>
>> This is very L4M3 and so 80ies.
>>
>> Any help appreciated. Thx!
>>
>


How to lookup by a key in an RDD

2015-10-31 Thread swetha
Hi,

I have a requirement wherein I have to load data from hdfs, build an RDD and
then lookup by key to do some updates to the value and then save it back to
hdfs. How to lookup for a value using a key in an RDD?


Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-lookup-by-a-key-in-an-RDD-tp25243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to lookup by a key in an RDD

2015-10-31 Thread Natu Lauchande
Hi,

Looking here for the lookup function might help you:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Natu

On Sat, Oct 31, 2015 at 6:04 PM, swetha  wrote:

> Hi,
>
> I have a requirement wherein I have to load data from hdfs, build an RDD
> and
> then lookup by key to do some updates to the value and then save it back to
> hdfs. How to lookup for a value using a key in an RDD?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-lookup-by-a-key-in-an-RDD-tp25243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Ted, thx. Should I repost?
Am 31.10.2015 17:41 schrieb "Ted Yu" :

> From the result of http://search-hadoop.com/?q=spark+Martin+Senne ,
> Martin's post Tuesday didn't go through.
>
> FYI
>
> On Sat, Oct 31, 2015 at 9:34 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Nabble is an unofficial archive of this mailing list. I don't know who
>> runs it, but it's not Apache. There are often delays between when things
>> get posted to the list and updated on Nabble, and sometimes things never
>> make it over for whatever reason.
>>
>> This mailing list is, I agree, very 1980s. Unfortunately, it's required
>> by the Apache Software Foundation (ASF).
>>
>> There was a discussion earlier this year
>> 
>>  about
>> migrating to Discourse that explained why we're stuck with what we have for
>> now. Ironically, that discussion is hard to follow on the Apache archives
>> (which is precisely one of the motivations for proposing to migrate to
>> Discourse), but there is a more readable archive on another unofficial
>> site
>> 
>> .
>>
>> Nick
>>
>> On Sat, Oct 31, 2015 at 12:20 PM Martin Senne <
>> martin.se...@googlemail.com> wrote:
>>
>>> Having written a post on last Tuesday, I'm still not able to see my post
>>> under nabble. And yeah, subscription to u...@apache.spark.org was
>>> successful (rechecked a minute ago)
>>>
>>> Even more, I have no way (and no confirmation) that my post was
>>> accepted, rejected, whatever.
>>>
>>> This is very L4M3 and so 80ies.
>>>
>>> Any help appreciated. Thx!
>>>
>>
>


Why does predicate pushdown not work on HiveContext (concrete HiveThriftServer2) ?

2015-10-31 Thread Martin Senne
Hi all,

# Programm Sketch

I create a HiveContext `hiveContext`
With that context, I create a DataFrame `df` from a JDBC relational table.I
register the DataFrame `df` viadf.registerTempTable("TESTTABLE")I start a
HiveThriftServer2 via
HiveThriftServer2.startWithContext(hiveContext)

The TESTTABLE contains 1,000,000 entries, columns are ID (INT) and NAME
(VARCHAR)

+-++
| ID  |  NAME  |
+-++
| 1   | Hello  |
| 2   | Hello  |
| 3   | Hello  |
| ... | ...|

With Beeline I access the SQL Endpoint (at port 1) of the
HiveThriftServer and perform a query. E.g.

SELECT * FROM TESTTABLE WHERE ID='3'

When I inspect the QueryLog of the DB with the SQL Statements executed I see

/*SQL #:100 t:657*/  SELECT \"ID\",\"NAME\" FROM test;

So there happens no predicate pushdown , as the where clause is missing.

# Questions

This gives raise to the following questions:

Why is no predicate pushdown performed?Can this be changed by not using
registerTempTable? If so, how?
Or is this a known restriction of the HiveThriftServer?

# Counterexample

If I create a DataFrame `df` in Spark SQLContext and call

df.filter( df("ID") === 3).show()
I observe

/*SQL #:1*/SELECT \"ID\",\"NAME\" FROM test WHERE ID = 3;

as expected.