How to load partial data from HDFS using Spark SQL

2016-01-01 Thread SRK
Hi,

How to load partial data from hdfs using Spark SQL? Suppose I want to load
data based on a filter like

"Select * from table where id = " using Spark SQL with DataFrames,
how can that be done? The 

idea here is that I do not want to load the whole data into memory when I
use the SQL and I just want to

load the data based on the filter. 


Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-partial-data-from-HDFS-using-Spark-SQL-tp25855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to load partial data from HDFS using Spark SQL

2016-01-01 Thread UMESH CHAUDHARY
Ok, so whats wrong in using :

var df=HiveContext.sql("Select * from table where id = ")
//filtered data frame
df.count

On Sat, Jan 2, 2016 at 11:56 AM, SRK  wrote:

> Hi,
>
> How to load partial data from hdfs using Spark SQL? Suppose I want to load
> data based on a filter like
>
> "Select * from table where id = " using Spark SQL with DataFrames,
> how can that be done? The
>
> idea here is that I do not want to load the whole data into memory when I
> use the SQL and I just want to
>
> load the data based on the filter.
>
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-partial-data-from-HDFS-using-Spark-SQL-tp25855.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [SparkSQL][Parquet] Read from nested parquet data

2016-01-01 Thread lin
Hi Cheng,

Thank you for your informative explanation; it is quite helpful.
We'd like to try both approaches; should we have some progress, we would
update this thread so that anybody interested can follow.

Thanks again @yanboliang, @chenglian!


Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Umesh Kacha
Hi thanks I did that and I have attached thread dump images. That was the
intention of my question asking for help to identify which waiting thread
is culprit.

Regards,
Umesh

On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph 
wrote:

> Take thread dump of Executor process several times in a short time period
> and check what each threads are doing at different times which will help to
> identify the expensive sections in user code.
>
> Thanks,
> Prabhu Joseph
>
> On Sat, Jan 2, 2016 at 3:28 AM, unk1102  wrote:
>
>> Sorry please see attached waiting thread log
>>
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>> >
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Cannot get repartitioning to work

2016-01-01 Thread Jeff Zhang
You are using the wrong RDD, use the returned RDD as following.

val repartitionedRDD = results.repartition(20)
println(repartitionedRDD.partitions.size)

On Sat, Jan 2, 2016 at 10:38 AM, jimitkr  wrote:

> Hi,
>
> I'm trying to test some custom parallelism and repartitioning in spark.
>
> First, i reduce my RDD (forcing creation of 10 partitions for the same).
>
> I then repartition the data to 20 partitions and print out the number of
> partitions, but i always get 10. Looks like the repartition command is
> getting ignored.
>
> How do i get repartitioning to work? See code below:
>
>   val
> results=input.reduceByKey((x,y)=>x+y,10).persist(StorageLevel.DISK_ONLY)
> results.repartition(20)
> println(results.partitions.size)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-get-repartitioning-to-work-tp25852.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang


Unable to read JSON input in Spark (YARN Cluster)

2016-01-01 Thread ๏̯͡๏
Version: Spark 1.5.2

*Spark built with Hive*
git clone git://github.com/apache/spark.git
./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver


*Input:*
-sh-4.1$ hadoop fs -du -h /user/dvasthimal/poc_success_spark/data/input
2.5 G  /user/dvasthimal/poc_success_spark/data/input/dw_bid_1231.seq
2.5 G
 /user/dvasthimal/poc_success_spark/data/input/dw_mao_item_best_offr_1231.seq
*5.9 G
 /user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json*
-sh-4.1$

*Spark Shell:*

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apache/hadoop/lib/native/
export SPARK_HOME=*/home/dvasthimal/spark-1.5.2-bin-2.4.0*
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
export HADOOP_CONF_DIR=/apache/hadoop/conf
cd $SPARK_HOME
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-21.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-EBAY-21/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-EBAY-21/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-21.jar:/home/dvasthimal/pig_jars/sojourner-common-0.1.3-hadoop2.jar:/home/dvasthimal/pig_jars/jackson-mapper-asl-1.8.5.jar:/home/dvasthimal/pig_jars/sojourner-common-0.1.3-hadoop2.jar:/home/dvasthimal/pig_jars/experimentation-reporting-common-0.0.1-SNAPSHOT.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-2.4.1-EBAY-11.jar
./bin/spark-shell

import org.apache.hadoop.io.Text
import org.codehaus.jackson.map.ObjectMapper
import com.ebay.hadoop.platform.model.SessionContainer
import scala.collection.JavaConversions._
import  com.ebay.globalenv.sojourner.TrackingProperty
import  java.net.URLDecoder
import  com.ebay.ep.reporting.common.util.TagsUtil
import org.apache.hadoop.conf.Configuration
import sqlContext.implicits._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df =
sqlContext.read.json("/user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json")


*Errors:*
1.

16/01/01 18:36:12 INFO json.JSONRelation: Listing
hdfs://apollo-phx-nn-ha/user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json
on driver
16/01/01 18:36:12 INFO storage.MemoryStore: ensureFreeSpace(268744) called
with curMem=0, maxMem=556038881
16/01/01 18:36:12 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 262.4 KB, free 530.0 MB)
16/01/01 18:36:12 INFO storage.MemoryStore: ensureFreeSpace(24028) called
with curMem=268744, maxMem=556038881
16/01/01 18:36:12 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 23.5 KB, free 530.0 MB)
16/01/01 18:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:59605 (size: 23.5 KB, free: 530.3 MB)
16/01/01 18:36:12 INFO spark.SparkContext: Created broadcast 0 from json at
:36
16/01/01 18:36:12 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at
com.hadoop.compression.lzo.GPLNativeCodeLoader.(GPLNativeCodeLoader.java:32)
at com.hadoop.compression.lzo.LzoCodec.(LzoCodec.java:71)


2.
16/01/01 18:36:44 INFO executor.Executor: Finished task 3.0 in stage 0.0
(TID 3). 2256 bytes result sent to driver
16/01/01 18:36:44 INFO scheduler.TaskSetManager: Finished task 3.0 in stage
0.0 (TID 3) in 32082 ms on localhost (2/24)
16/01/01 18:36:54 ERROR executor.Executor: Exception in task 9.0 in stage
0.0 (TID 9)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:243)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)

3.
TypeRef(TypeSymbol(class $read extends Serializable))

uncaught exception during compilation: java.lang.AssertionError
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
in stage 0.0 failed 1 times, most recent failure: Lost task 9.0 in stage
0.0 (TID 9, localhost): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:243)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
4.
16/01/01 18:36:56 ERROR util.Utils: Uncaught exception in thread Executor
task launch worker-19
java.lang.NullPointerException
at
org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:94)
at 

Re: How to save only values via saveAsHadoopFile or saveAsNewAPIHadoopFile

2016-01-01 Thread jimitkr
Doesn't this work?pair.values.saveAsHadoopFile()



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-only-values-via-saveAsHadoopFile-or-saveAsNewAPIHadoopFile-tp25828p25853.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Hi I have a Spark job which hangs for around 7 hours or more than that until
jobs killed out by Autosys because of time out. Data is not huge I am sure
it stucks because of GC but I cant find source code which causes GC I am
reusing almost all variable trying to minimize creating local objects though
I cant avoid creating many String objects in order to update DataFrame
values. When I see live thread debug in the executor where job is running I
see attached running/waiting threads. Please guide me to find which waiting
thread is culprit and preventing my job to finish. My code uses
dataframe.group by one around 8 fields and also uses coalese(1) twice so it
shuffles huge amounts of data in terms of GBs in each executor when I see in
the UI.


 

 

Here is the heap space error which is I dont understand how to resolve in my
code 


 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Prabhu Joseph
Take thread dump of Executor process several times in a short time period
and check what each threads are doing at different times which will help to
identify the expensive sections in user code.

Thanks,
Prabhu Joseph

On Sat, Jan 2, 2016 at 3:28 AM, unk1102  wrote:

> Sorry please see attached waiting thread log
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
> >
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
Hi Kostiantyn,

You should be able to use spark.conf to specify s3a keys.

I don't remember exactly but you can add hadoop properties by prefixing 
spark.hadoop.*
* is the s3a properties. For instance,

spark.hadoop.s3a.access.key wudjgdueyhsj

Of course, you need to make sure the property key is right. I'm using my phone 
so I cannot easily verifying.

Then you can specify different user using different spark.conf via 
--properties-file when spark-submit

HTH,

Jerry

Sent from my iPhone

> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Hi Jerry,
> 
> what you suggested looks to be working (I put hdfs-site.xml into 
> $SPARK_HOME/conf folder), but could you shed some light on how it can be 
> federated per user?
> Thanks in advance!
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
>> Hi Kostiantyn,
>> 
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
>> could define different spark-{user-x}.conf and source them during 
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> Sent from my iPhone
>> 
>>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> 
>>> Hi Jerry,
>>> 
>>> I want to run different jobs on different S3 buckets - different AWS creds 
>>> - on the same instances. Could you shed some light if it's possible to 
>>> achieve with hdfs-site?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
 On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
 Hi Kostiantyn,
 
 Can you define those properties in hdfs-site.xml and make sure it is 
 visible in the class path when you spark-submit? It looks like a conf 
 sourcing issue to me. 
 
 Cheers,
 
 Sent from my iPhone
 
> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Chris,
> 
> thanks for the hist with AIM roles, but in my case  I need to run 
> different jobs with different S3 permissions on the same cluster, so this 
> approach doesn't work for me as far as I understood it
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>> couple things:
>> 
>> 1) switch to IAM roles if at all possible - explicitly passing AWS 
>> credentials is a long and lonely road in the end
>> 
>> 2) one really bad workaround/hack is to run a job that hits every worker 
>> and writes the credentials to the proper location (~/.awscredentials or 
>> whatever)
>> 
>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>> 
>> if you switch to IAM roles, things become a lot easier as you can 
>> authorize all of the EC2 instances in the cluster - and handles 
>> autoscaling very well - and at some point, you will want to autoscale.
>> 
>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> Chris,
>>> 
>>>  good question, as you can see from the code I set up them on driver, 
>>> so I expect they will be propagated to all nodes, won't them?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
 are the credentials visible from each Worker node to all the Executor 
 JVMs on each Worker?
 
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Dear Spark community,
> 
> I faced the following issue with trying accessing data on S3a, my 
> code is the following:
> 
> val sparkConf = new SparkConf()
> 
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
> val sqlContext = SQLContext.getOrCreate(sc)
> val df = sqlContext.read.parquet(...)
> df.count
> 
> It results in the following exception and log messages:
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: Access key or secret 
> key is null
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> 

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Sorry please see attached waiting thread log


 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



frequent itemsets

2016-01-01 Thread Roberto Pagliari
When using the frequent itemsets APIs, I'm running into stackOverflow exception 
whenever there are too many combinations to deal with and/or too many 
transactions and/or too many items.


Does anyone know how many transactions/items these APIs can deal with?


Thank you ,



Cannot get repartitioning to work

2016-01-01 Thread jimitkr
Hi,

I'm trying to test some custom parallelism and repartitioning in spark. 

First, i reduce my RDD (forcing creation of 10 partitions for the same). 

I then repartition the data to 20 partitions and print out the number of
partitions, but i always get 10. Looks like the repartition command is
getting ignored. 

How do i get repartitioning to work? See code below:

  val
results=input.reduceByKey((x,y)=>x+y,10).persist(StorageLevel.DISK_ONLY)
results.repartition(20)
println(results.partitions.size)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-get-repartitioning-to-work-tp25852.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
Hi Jia,

I think the examples you provided is not very suitable to illustrate what
driver and executors do, because it's not show the internal implementation
of the KMeans algorithm.
You can refer the source code of MLlib Kmeans (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L227
).
In short words, the driver need the memory of O(centers size) but each
executors need the memory of O(partition size). Usually we have large
dataset and distributed the whole dataset at many executors, but the
centers is not very big even compared with the dataset at one executor.

Cheers
Yanbo

2015-12-31 22:31 GMT+08:00 Jia Zou :

> Thanks, Yanbo.
> The results become much more reasonable, after I set driver memory to 5GB
> and increase worker memory to 25GB.
>
> So, my question is for following code snippet extracted from main method
> in JavaKMeans.java in examples, what will the driver do? and what will the
> worker do?
>
> I didn't understand this problem well by reading
> https://spark.apache.org/docs/1.1.0/cluster-overview.htmland
> http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark
>
> SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans");
>
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
>
> JavaRDD lines = sc.textFile(inputFile);
>
> JavaRDD points = lines.map(new ParsePoint());
>
>  points.persist(StorageLevel.MEMORY_AND_DISK());
>
> KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs,
> KMeans.K_MEANS_PARALLEL());
>
>
> Thank you very much!
>
> Best Regards,
> Jia
>
> On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang  wrote:
>
>> Hi Jia,
>>
>> You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether
>> it can produce stable performance. The storage level of MEMORY_AND_DISK
>> will store the partitions that don't fit on disk and read them from there
>> when they are needed.
>> Actually, it's not necessary to set so large driver memory in your case,
>> because KMeans use low memory for driver if your k is not very large.
>>
>> Cheers
>> Yanbo
>>
>> 2015-12-30 22:20 GMT+08:00 Jia Zou :
>>
>>> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8
>>> CPU cores and 30GB memory. Executor memory is set to 15GB, and driver
>>> memory is set to 15GB.
>>>
>>> The observation is that, when input data size is smaller than 15GB, the
>>> performance is quite stable. However, when input data becomes larger than
>>> that, the performance will be extremely unpredictable. For example, for
>>> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
>>> dramatically different testing results: 27mins, 61mins and 114 mins. (All
>>> settings are the same for the 3 tests, and I will create input data
>>> immediately before running each of the tests to keep OS buffer cache hot.)
>>>
>>> Anyone can help to explain this? Thanks very much!
>>>
>>>
>>
>


Re: does HashingTF maintain a inverse index?

2016-01-01 Thread Yanbo Liang
Hi Andy,

Spark ML/MLlib does not provide a transformer to map HashingTF generated
feature back to words currently.

2016-01-01 8:37 GMT+08:00 Hayri Volkan Agun :

> Hi,
>
> If you are using pipeline api, you do not need to map features back to
> documents.
> Your input (which is the document text) won't change after you used
> HashingTF.
> If you want to do Information Retrieval with spark, I suggest you to use
> not the pipeline but RDDs...
>
> On Fri, Jan 1, 2016 at 2:20 AM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> Hi
>>
>> I am working on proof of concept. I am trying to use spark to classify
>> some documents. I am using tokenizer and hashingTF to convert the documents
>> into vectors. Is there any easy way to map feature back to words or do I
>> need to maintain the reverse index my self? I realize there is a chance
>> some words map to same buck
>>
>> Kind regards
>>
>> Andy
>>
>>
>
>
> --
> Hayri Volkan Agun
> PhD. Student - Anadolu University
>


Re: NotSerializableException exception while using TypeTag in Scala 2.10

2016-01-01 Thread Yanbo Liang
I also hit this bug, have you resolved this issue? Or could you give some
suggestions?

2014-07-28 18:33 GMT+08:00 Aniket Bhatnagar :

> I am trying to serialize objects contained in RDDs using runtime
> relfection via TypeTag. However, the Spark job keeps
> failing java.io.NotSerializableException on an instance of TypeCreator
> (auto generated by compiler to enable TypeTags). Is there any workaround
> for this without switching to scala 2.11?
>


Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF:

val hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("features")
  .setNumFeatures(n)


2015-10-16 0:17 GMT+08:00 Nick Pentreath :

> Setting the numfeatures higher than vocab size will tend to reduce the
> chance of hash collisions, but it's not strictly necessary - it becomes a
> memory / accuracy trade off.
>
> Surprisingly, the impact on model performance of moderate hash collisions
> is often not significant.
>
> So it may be worth trying a few settings out (lower than vocab, higher
> etc) and see what the impact is on evaluation metrics.
>
> —
> Sent from Mailbox 
>
>
> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
> wrote:
>
>> Hi,
>>
>> There is a parameter in the HashingTF called "numFeatures". I was
>> wondering what is the best way to set the value to this parameter. In the
>> use case of text categorization, do you need to know in advance the number
>> of words in your vocabulary? or do you set it to be a large value, greater
>> than the number of words in your vocabulary?
>>
>> Thanks,
>>
>> Jianguo
>>
>
>


Re: ERROR server.TThreadPoolServer: Error occurred during processing of message

2016-01-01 Thread Dasun Hegoda
?

On Tue, Dec 29, 2015 at 12:08 AM, Dasun Hegoda 
wrote:

> Anyone?
>
> On Sun, Dec 27, 2015 at 11:30 AM, Dasun Hegoda 
> wrote:
>
>> I was able to figure out where the problem is exactly. It's spark.
>> because when I start the hiveserver2 manually and run query it work fine.
>> but when I try to access the hive through spark's thrift port it does not
>> work. throws the above mentioned error.
>>
>> Please help me to fix this.
>>
>> On Sun, Dec 27, 2015 at 11:15 AM, Dasun Hegoda 
>> wrote:
>>
>>> Yes, didn't work for me
>>>
>>>
>>> On Sun, Dec 27, 2015 at 10:56 AM, Ted Yu  wrote:
>>>
 Have you seen this ?


 http://stackoverflow.com/questions/30705576/python-cannot-connect-hiveserver2

 On Sat, Dec 26, 2015 at 9:09 PM, Dasun Hegoda 
 wrote:

> I'm running apache-hive-1.2.1-bin and spark-1.5.1-bin-hadoop2.6. spark
> as the hive engine. When I try to connect through JasperStudio using 
> thrift
> port I get below error. I'm running ubuntu 14.04.
>
> 15/12/26 23:36:20 ERROR server.TThreadPoolServer: Error occurred
> during processing of message.
> java.lang.RuntimeException:
> org.apache.thrift.transport.TSaslTransportException: No data or no sasl
> data in the stream
> at
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.thrift.transport.TSaslTransportException: No
> data or no sasl data in the stream
> at
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:328)
> at
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
> at
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
> ... 4 more
> 15/12/26 23:36:20 INFO thrift.ThriftCLIService: Client protocol
> version: HIVE_CLI_SERVICE_PROTOCOL_V5
> 15/12/26 23:36:20 INFO session.SessionState: Created local
> directory: /tmp/c670ff55-01bb-4f6f-a375-d22a13c44eaf_resources
> 15/12/26 23:36:20 INFO session.SessionState: Created HDFS
> directory: /tmp/hive/anonymous/c670ff55-01bb-4f6f-a375-d22a13c44eaf
> 15/12/26 23:36:20 INFO session.SessionState: Created local
> directory: /tmp/hduser/c670ff55-01bb-4f6f-a375-d22a13c44eaf
> 15/12/26 23:36:20 INFO session.SessionState: Created HDFS
> directory:
> /tmp/hive/anonymous/c670ff55-01bb-4f6f-a375-d22a13c44eaf/_tmp_space.db
> 15/12/26 23:36:20 INFO
> thriftserver.SparkExecuteStatementOperation: Running query 'use default'
> with d842cd88-2fda-42b2-b943-468017e95f37
> 15/12/26 23:36:20 INFO parse.ParseDriver: Parsing command: use
> default
> 15/12/26 23:36:20 INFO parse.ParseDriver: Parse Completed
> 15/12/26 23:36:20 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/12/26 23:36:20 INFO log.PerfLogger:  method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
> 15/12/26 23:36:20 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/12/26 23:36:20 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/12/26 23:36:20 INFO parse.ParseDriver: Parsing command: use
> default
> 15/12/26 23:36:20 INFO parse.ParseDriver: Parse Completed
> 15/12/26 23:36:20 INFO log.PerfLogger:  start=1451190980590 end=1451190980591 duration=1
> from=org.apache.hadoop.hive.ql.Driver>
> 15/12/26 23:36:20 INFO log.PerfLogger:  method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
> 15/12/26 23:36:20 INFO metastore.HiveMetaStore: 2: get_database:
> default
> 15/12/26 23:36:20 INFO HiveMetaStore.audit: ugi=hduser
> ip=unknown-ip-addr cmd=get_database: default
> 15/12/26 23:36:20 INFO metastore.HiveMetaStore: 2: Opening raw
> store with implemenation 
> class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/12/26 23:36:20 INFO metastore.ObjectStore: ObjectStore,
> initialize called
> 15/12/26 23:36:20 INFO DataNucleus.Query: Reading in results for
> query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the
> connection used is closing
> 15/12/26 23:36:20 INFO metastore.MetaStoreDirectSql: Using direct
> SQL, underlying DB is DERBY
> 15/12/26 23:36:20 INFO metastore.ObjectStore: Initialized
> ObjectStore
> 15/12/26 

sqlContext Client cannot authenticate via:[TOKEN, KERBEROS]

2016-01-01 Thread philippe L
Hi everyone,

I'm actually facing a weird situation with the hivecontext and Kerberos on
yarn-client mode.
actual configuration:
HDP 2.2 ( Hive 0.14 , HDFS 2.6 , yarn 2.6 ) - SPARK 1.5.2 and  HA namenode
activated - Kerberos enabled

Situation :
In the same spark context, I do receive "random" kerberos errors. I assume
it is random because when I retry in the same context (session) I do arrive
to receive my result.
I checked the date between server and my ntp are correct no time gap.

[error on]
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 0.0 (TID 6, [datanode fqdn]):
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "[datanode
fqdn]/[datanode 9 address]"; destination host is: "[namenode fqdn]":8020;
[error off]

[code on]
val res = sqlContext.sql("SELECT * FROM foo.song")
val counts = res.map(row =>
row.getString(0).replace("""([\p{Punct}&&[^.@]]|\b\p{IsLetter}{1,2}\b)\s*""",
"")).flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ +
_)
counts.collect()
[code off]
I tried alos on several tables and databases ( around 40 different tables
with different strucure and size) stored in my hiveserver and I receive the
same error always randomely ( because when I try the same query a moment
after, it does function). 


Also, I tried a stupid test, because I'm on HA namenode, I activate the nn2
in place of the nn1 and then my query did function BUT without any result
!
++  
|sentence|
++
++  
|   0|
++

I would not believe that my hive-site.xml is incorrect because as I said, I
can access the data, time to time.

What does function  :
Hive shell
beeline ( connection and queries)
spark-shell sc.textFile operations

Logging inspection:
nothing in yarn logs
no error in hiveserver2.logs


My question: 
how does the hivecontext function with the kerberos ticket? and how can 
I
solve my problem or at least have more detail to debug it? 


In advance, thank you for your answers.

Best,


Configuration detail: 

spark config : 
symbolic link of hive-site.xml to my spark conf dir : ln -s
/etc/hive/conf/hive-site.xml /home/myuser/spark/conf/
sh-4.1$ more java-opts 
-Dhdp.version=2.2.4.2-2
sh-4.1$ more spark-env.sh
export HADOOP_HOME=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
sh-4.1$ more spark-default.sh
spark.driver.extraJavaOptions -Dhdp.version=2.2.4.4-16 -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.4.4-16
spark.yarn.dist.files 
/usr/hdp/2.2.4.4-16/spark/conf/metrics.properties

spark launch methode  in yarn-client mode: 
kinit 
export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
export YARN_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/home/myuser/spark
export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/hdp/2.2.0.0-2041/hadoop/lib/native
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/hdp/2.2.0.0-2041/hadoop/lib/native
export
SPARK_YARN_USER_ENV="JAVA_LIBRARY_PATH=$JAVA_LI:BRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
/home/myuser/spark/bin/spark-shell --master yarn-client --driver-memory 8g
--executor-memory 8g --num-executors 4 --executor-cores 5 --conf
"spark.storage.memoryFraction=0.2" --conf "spark.rdd.compress=true" --conf
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf
"spark.kryoserializer.buffer.max=512"




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-Client-cannot-authenticate-via-TOKEN-KERBEROS-tp25848.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to extend java transformer from Scala UnaryTransformer ?

2016-01-01 Thread Andy Davidson
I am trying to write a trivial transformer I use use in my pipeline. I am
using java and spark 1.5.2. It was suggested that I use the Tokenize.scala
class as an example. This should be very easy how ever I do not understand
Scala, I am having trouble debugging the following exception.

Any help would be greatly appreciated.

Happy New Year

Andy

java.lang.IllegalArgumentException: requirement failed: Param null__inputCol
does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
at org.apache.spark.ml.param.Params$class.set(params.scala:436)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at org.apache.spark.ml.param.Params$class.set(params.scala:422)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)



public class StemmerTest extends AbstractSparkTest {

@Test

public void test() {

Stemmer stemmer = new Stemmer()

.setInputCol("raw²) //line 30

.setOutputCol("filtered");

}

}


/**

 * @ see 
spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala

 * @ see 
https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive
-bayes-on-apache-spark-mllib/

 * @ see 
http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hort
onworks/

 * 

 * @author andrewdavidson

 *

 */

public class Stemmer extends UnaryTransformer implements Serializable{

static Logger logger = LoggerFactory.getLogger(Stemmer.class);

private static final long serialVersionUID = 1L;

private static final  ArrayType inputType =
DataTypes.createArrayType(DataTypes.StringType, true);

private final String uid = Stemmer.class.getSimpleName() + "_" +
UUID.randomUUID().toString();



@Override

public String uid() {

return uid;

}



/*

   override protected def validateInputType(inputType: DataType): Unit =
{

require(inputType == StringType, s"Input type must be string type but
got $inputType.")

  }

 */

@Override

public void validateInputType(DataType inputTypeArg) {

String msg = "inputType must be " + inputType.simpleString() + " but
got " + inputTypeArg.simpleString();

assert (inputType.equals(inputTypeArg)) : msg;

}



@Override

public Function1 createTransformFunc() {

// 
http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-fun
ctions-as-parameters

Function1 f = new
AbstractFunction1() {

public List apply(List words) {

for(String word : words) {

logger.error("AEDWIP input word: {}", word);

}

return words;

}

};



return f;

}



@Override

public DataType outputDataType() {

return DataTypes.createArrayType(DataTypes.StringType, true);

}

}




Deploying on TOMCAT

2016-01-01 Thread rahulganesh
I am having trouble in deploying spark on tomcat server. I have created a
spark java program and i have created a servlet to access it in the web
application. But when ever i run the i am not able to get the output says
java.lang.outOfMemory or some other errors. Is it possible to deploy spark
on tomcat server ? If yes how to create a web application that runs spark
programs? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-on-TOMCAT-tp25846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org