Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Jorge Sánchez
Hi Gerard,

have you tried running in yarn-client mode? If so, do you still get that
same error?

Regards.

2016-12-05 12:49 GMT+00:00 Gerard Casey :

> Edit. From here
> 
>  I
> read that you can pass a `key tab` option to spark-submit. I thus tried
>
> *spark-submit --class "graphx_sp" --master yarn  *--keytab
> /path/to/keytab  *--deploy-mode cluster --executor-memory 13G
> --total-executor-cores 32 target/scala-2.10/graphx_sp_2.10-1.0.jar*
>
> However, the error persists
>
> Any ideas?
>
> Thanks
>
> Geroid
>
> On 5 Dec 2016, at 13:35, Gerard Casey  wrote:
>
> Hello all,
>
> I am using Spark with Kerberos authentication.
>
> I can run my code using `spark-shell` fine and I can also use
> `spark-submit` in local mode (e.g. —master local[16]). Both function as
> expected.
>
> local mode -
>
> *spark-submit --class "graphx_sp" --master local[16] --driver-memory 20G
> target/scala-2.10/graphx_sp_2.10-1.0.jar*
>
> I am now progressing to run in cluster mode using YARN.
>
> cluster mode with YARN -
>
> *spark-submit --class "graphx_sp" --master yarn --deploy-mode cluster
> --executor-memory 13G --total-executor-cores 32
> target/scala-2.10/graphx_sp_2.10-1.0.jar*
>
> However, this returns:
>
> *diagnostics: User class threw exception:
> org.apache.hadoop.security.AccessControlException: Authentication required*
>
> Before I run using spark-shell or on local mode in spark-submit I do the
> following kerberos setup:
>
> kinit -k -t ~/keytab -r 7d `whoami`
>
> Clearly, this setup is not extending to the YARN setup. How do I fix the
> Kerberos issue with YARN in cluster mode? Is this something which must be
> in my /src/main/scala/graphx_sp.scala file?
>
> Many thanks
>
> Geroid
>
>
>


Re: how to merge dataframe write output files

2016-11-10 Thread Jorge Sánchez
Do you have the logs of the containers? This seems like a Memory issue.

2016-11-10 7:28 GMT+00:00 lk_spark :

> hi,all:
> when I call api df.write.parquet ,there is alot of small files :   how
> can I merge then into on file ? I tried df.coalesce(1).write.parquet ,but
> it will get error some times
>
> Container exited with a non-zero exit code 143
>
> more an more...
> -rw-r--r--   2 hadoop supergroup 14.5 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00165-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 16.4 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00166-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00167-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 14.2 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00168-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00169-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 14.4 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00170-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00171-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00172-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 16.0 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00173-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00174-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 14.0 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00175-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00176-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc
> more an more...
> 2016-11-10
> --
> lk_spark
>


Re: Sqoop on Spark

2016-04-06 Thread Jorge Sánchez
Ayan,

there was a talk in spark summit
https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/
Apparently they had a lot of problems and the project seems abandoned.

If you just have to do simple ingestion of a full table or a simple query,
just use Sqoop as suggested by Mich, but if your use case requires further
transformation of the data, I'd suggest you try Spark connecting to Oracle
using JDBC and then having the data as a Dataframe.

Regards.

2016-04-06 6:59 GMT+01:00 ayan guha :

> Thanks guys for feedback.
>
> On Wed, Apr 6, 2016 at 3:44 PM, Jörn Franke  wrote:
>
>> I do not think you can be more resource efficient. In the end you have to
>> store the data anyway on HDFS . You have a lot of development effort for
>> doing something like sqoop. Especially with error handling.
>> You may create a ticket with the Sqoop guys to support Spark as an
>> execution engine and maybe it is less effort to plug it in there.
>> Maybe if your cluster is loaded then you may want to add more machines or
>> improve the existing programs.
>>
>> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>>
>> One of the reason in my mind is to avoid Map-Reduce application
>> completely during ingestion, if possible. Also, I can then use Spark stand
>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What
>> you guys think?
>>
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
>>
>>> Why do you want to reimplement something which is already there?
>>>
>>> On 06 Apr 2016, at 06:47, ayan guha  wrote:
>>>
>>> Hi
>>>
>>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>>> index and incremental only) and add data to existing Hive tables. Also, it
>>> would be good to have an option to create Hive table, driven by job
>>> specific configuration.
>>>
>>> What do you think?
>>>
>>> Best
>>> Ayan
>>>
>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>>> wrote:
>>>
 Hi,

 It depends on your use case using sqoop.
 What's it like?

 // maropu

 On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:

> Hi All
>
> Asking opinion: is it possible/advisable to use spark to replace what
> sqoop does? Any existing project done in similar lines?
>
> --
> Best Regards,
> Ayan Guha
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: an error when I read data from parquet

2016-02-22 Thread Jorge Sánchez
Hi Alex,

it seems there is a problem with Spark Notebook, I suggest you follow the
issue there (Or you could try Apache Zeppelin or Spark-Shell directly if
notebooks are not a requirement):

https://github.com/andypetrella/spark-notebook/issues/380

Regards.

2016-02-19 12:59 GMT+00:00 AlexModestov :

> Hello everybody,
>
> I use Python API and Scala API. I read data without problem with Python
> API:
>
> "sqlContext = SQLContext(sc)
> data_full = sqlContext.read.parquet("---")"
>
> But when I use Scala:
>
> "val sqlContext = new SQLContext(sc)
> val data_full = sqlContext.read.parquet("---")"
>
> I get the error (I use Spark-Notebook may be it is important):
> "java.lang.ExceptionInInitializerError
> at sun.misc.Unsafe.ensureClassInitialized(Native Method)
> at
>
> sun.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(UnsafeFieldAccessorFactory.java:43)
> at
> sun.reflect.ReflectionFactory.newFieldAccessor(ReflectionFactory.java:140)
> at java.lang.reflect.Field.acquireFieldAccessor(Field.java:1057)
> at java.lang.reflect.Field.getFieldAccessor(Field.java:1038)
> at java.lang.reflect.Field.get(Field.java:379)
> at notebook.kernel.Repl.getModule$1(Repl.scala:203)
> at notebook.kernel.Repl.iws$1(Repl.scala:212)
> at notebook.kernel.Repl.liftedTree1$1(Repl.scala:219)
> at notebook.kernel.Repl.evaluate(Repl.scala:199)
> at
>
> notebook.client.ReplCalculator$$anonfun$15$$anon$1$$anonfun$29.apply(ReplCalculator.scala:378)
> at
>
> notebook.client.ReplCalculator$$anonfun$15$$anon$1$$anonfun$29.apply(ReplCalculator.scala:375)
> at
>
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.NoSuchMethodException:
>
> org.apache.spark.io.SnappyCompressionCodec.(org.apache.spark.SparkConf)
> at java.lang.Class.getConstructor0(Class.java:2892)
> at java.lang.Class.getConstructor(Class.java:1723)
> at
>
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:71)
> at
>
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
> at
> org.apache.spark.broadcast.TorrentBroadcast.org
> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
>
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
> at
>
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:108)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at
>
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at
>
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
> at
>
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
> at
>
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
> at
>
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.toJSON(DataFrame.scala:1724)
> at
>
> notebook.front.widgets.DataFrameView$class.notebook$front$widgets$DataFrameView$$json(DataFrame.scala:40)
> at
>
> notebook.front.widgets.DataFrameWidget.notebook$front$widgets$DataFrameView$$json$lzycompute(DataFrame.scala:64)
> at
>
> notebook.front.widgets.DataFrameWidget.notebook$front$widgets$DataFrameView$$json(DataFrame.scala:64)
> at
> notebook.front.widgets.DataFrameView$class.$init$(DataFrame.s

Re: How VectorIndexer works in Spark ML pipelines

2015-10-18 Thread Jorge Sánchez
Vishnu,

VectorIndexer
 will
add metadata regarding which features are categorical and what are
continuous depending on the threshold, if there are more different unique
values than the *MaxCategories *parameter, they will be treated as
continuous. That will help the learning algorithms as they will be treated
differently.
>From the data I can see you have more than one Vector in the features
column? Try using some Vectors with only two different values.

Regards.

2015-10-15 10:14 GMT+01:00 VISHNU SUBRAMANIAN :

> HI All,
>
> I am trying to use the VectorIndexer (FeatureExtraction) technique
> available from the Spark ML Pipelines.
>
> I ran the example in the documentation .
>
> val featureIndexer = new VectorIndexer()
>   .setInputCol("features")
>   .setOutputCol("indexedFeatures")
>   .setMaxCategories(4)
>   .fit(data)
>
>
> And then I wanted to see what output it generates.
>
> After performing transform on the data set , the output looks like below.
>
> scala> predictions.select("indexedFeatures").take(1).foreach(println)
>
>
> [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]
>
>
> scala> predictions.select("features").take(1).foreach(println)
>
>
> [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]
>
> I can,t understand what is happening. I tried with simple data sets also ,
> but similar result.
>
> Please help.
>
> Thanks,
>
> Vishnu
>
>
>
>
>
>
>
>


Re: dataframes and numPartitions

2015-10-18 Thread Jorge Sánchez
Alex,

If not, you can try using the functions coalesce(n) or repartition(n).

As per the API, coalesce will not make a shuffle but repartition will.

Regards.

2015-10-16 0:52 GMT+01:00 Mohammed Guller :

> You may find the spark.sql.shuffle.partitions property useful. The default
> value is 200.
>
>
>
> Mohammed
>
>
>
> *From:* Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
> *Sent:* Wednesday, October 14, 2015 8:14 PM
> *To:* user
> *Subject:* dataframes and numPartitions
>
>
>
> A lot of RDD methods take a numPartitions parameter that lets you specify
> the number of partitions in the result. For example, groupByKey.
>
>
>
> The DataFrame counterparts don't have a numPartitions parameter, e.g.
> groupBy only takes a bunch of Columns as params.
>
>
>
> I understand that the DataFrame API is supposed to be smarter and go
> through a LogicalPlan, and perhaps determine the number of optimal
> partitions for you, but sometimes you want to specify the number of
> partitions yourself. One such use case is when you are preparing to do a
> "merge" join with another dataset that is similarly partitioned with the
> same number of partitions.
>


Re: Implement "LIKE" in SparkSQL

2015-09-14 Thread Jorge Sánchez
I think after you get your table as a DataFrame, you can do a filter over
it, something like:

val t = sqlContext.sql("select * from table t")
val df = t.filter(t("a").contains(t("b")))

Let us know the results.

2015-09-12 10:45 GMT+01:00 liam :

>
> OK, I got another way, it looks silly and low inefficiency but works.
>
> tradeDF.registerTempTable(tradeTab);
>
> orderDF.registerTempTable(orderTab);
>
> //orderId = tid + "_x"
>
> String sql1 = "select * from " + tradeTab + " a, " + orderTab + " b where
> substr(b.orderId,1,15) = substr(a.tid,1) ";
>
> String sql2 = "select * from " + tradeTab + " a, " + orderTab + " b where
> substr(b.orderId,1,16) = substr(a.tid,1) ";
>
> String sql3 = "select * from " + tradeTab + " a, " + orderTab + " b where
> substr(b.orderId,1,17) = substr(a.tid,1) ";
>
> DataFrame combinDF =
> sqlContext.sql(sql1).unionAll(sqlContext.sql(sql2)).unionAll(sqlContext.sql(sql3));
>
>
>  As I try :
>substr(b.orderId,1,length(a.tid)) = a.tid  *-> no length available*
>b.orderId like concat(a.tid,'%')   *-> no concat available*
>instr(b.orderId,a.tid) > 0*->** no instr available*
>locate(a.tid,b.orderId) > 0 *->** no locate available*
>..*-> no
> .. *
>
>
>
> 2015-09-12 13:49 GMT+08:00 Richard Eggert :
>
>> concat and locate are available as of version 1.5.0, according to the
>> Scaladocs. For earlier versions of Spark, and for the operations that are
>> still not supported,  it's pretty straightforward to define your own
>> UserDefinedFunctions in either Scala or Java  (I don't know about other
>> languages).
>> On Sep 11, 2015 10:26 PM, "liam"  wrote:
>>
>>> Hi,
>>>
>>>  Imaging this: the value of one column is the substring of another
>>> column, when using Oracle,I got many ways to do the query like the
>>> following statement,but how to do in SparkSQL since this no "concat(),
>>> instr(), locate()..."
>>>
>>>
>>> select * from table t where t.a like '%'||t.b||'%';
>>>
>>>
>>> Thanks.
>>>
>>>
>