Re: How to get the data url

2017-11-03 Thread 小野圭二
Thank you for your reply, jgp.
The URL in my question was just a sample. Whatever it is.

I mean, let's image multi user environment spark.
This is just a case model.
 -I am a watcher on a spark system.
 -Some users run their applications on my spark and ,I need to know what
URL running on my spark.
 -Maybe You say I can see the URL in a log file, but it is troublesome.
 -If there is a API that get a access url string, I could make a program to
gather all running URL.

I believe there is the best API in it, but just i do not know yet.

Thx.

-Keiji

2017-11-03 19:48 GMT+09:00 Jean Georges Perrin :

> I am a little confused by your question… Are you trying to ingest a file
> from S3?
>
> If so… look for net.jgp.labs.spark on GitHub and look for
> net.jgp.labs.spark.l000_ingestion.l001_csv_in_progress.S3CsvToDataset
>
> You can modify the file as the keys are yours…
>
> If you want to download first: look at net.jgp.labs.spark.l900_analytics.
> ListNCSchoolDistricts
>
> jgp
>
> On Oct 29, 2017, at 22:20, onoke  wrote:
>
> Hi,
>
> I am searching a useful API for getting a data URL that is accessed by a
> application on Spark.
> For example, when this URL is in a application
>
>   new
> URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv;)
>
> How to get this url from using Spark API?
> I looked in org.apach.api.java and org.apache.spark.status.api.v1, but
> they
> do not provide any URL info.
>
> Any advice are welcome.
>
> -Keiji
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


unable to run spark streaming example

2017-11-03 Thread Imran Rajjad
I am trying out the network word count example and my unit test is
producing the blow console output with an exception

Exception in thread "dispatcher-event-loop-5"
java.lang.NoClassDefFoundError:
scala/runtime/AbstractPartialFunction$mcVL$sp
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint.receive(ReceiverTracker.scala:476)
 at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
 at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
scala.runtime.AbstractPartialFunction$mcVL$sp
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 20 more
---
Time: 1509716745000 ms
---
---
Time: 1509716746000 ms
---

>From DOS I am pushing a text file through netcat with following command

nc -l -p  < license.txt

...

below are my spark related maven dependencies

2.1.1


   org.apache.spark
   spark-launcher_2.10
   ${spark.version}
  
  
   org.apache.spark
   spark-core_2.11
   ${spark.version}
   provided
  
  
   org.apache.spark
   spark-graphx_2.11
   ${spark.version}
   provided
  
  
   org.apache.spark
   spark-sql_2.11
   ${spark.version}
   provided
  
  
   graphframes
   graphframes
   0.5.0-spark2.1-s_2.11
  

  
org.apache.spark
 spark-mllib_2.10
 ${spark.version}
 provided
 

 
org.apache.spark
spark-streaming_2.10
${spark.version}
provided


-- 
I.R


Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread ayan guha
you can use 10 passes over the same dataset and build the data


On Fri, Nov 3, 2017 at 9:48 PM, Jean Georges Perrin 
wrote:

> Write a UDF?
>
> On Oct 31, 2017, at 11:48, Aakash Basu  wrote:
>
> Hey all,
>
> Any help in the below please?
>
> Thanks,
> Aakash.
>
>
> -- Forwarded message --
> From: Aakash Basu 
> Date: Tue, Oct 31, 2017 at 9:17 PM
> Subject: Regarding column partitioning IDs and names as per hierarchical
> level SparkSQL
> To: user 
>
>
> Hi all,
>
> I have to generate a table with Spark-SQL with the following columns -
>
>
> Level One Id: VARCHAR(20) NULL
> Level One Name: VARCHAR( 50) NOT NULL
> Level Two Id: VARCHAR( 20) NULL
> Level Two Name: VARCHAR(50) NULL
> Level Thr ee Id: VARCHAR(20) NULL
> Level Thr ee Name: VARCHAR(50) NULL
> Level Four Id: VARCHAR(20) NULL
> Level Four Name: VARCHAR( 50) NULL
> Level Five Id: VARCHAR(20) NULL
> Level Five Name: VARCHAR(50) NULL
> Level Six Id: VARCHAR(20) NULL
> Level Six Name: VARCHAR(50) NULL
> Level Seven Id: VARCHAR( 20) NULL
> Level Seven Name: VARCHAR(50) NULL
> Level Eight Id: VARCHAR( 20) NULL
> Level Eight Name: VARCHAR(50) NULL
> Level Nine Id: VARCHAR(20) NULL
> Level Nine Name: VARCHAR( 50) NULL
> Level Ten Id: VARCHAR(20) NULL
> Level Ten Name: VARCHAR(50) NULL
>
> My input source has these columns -
>
>
> ID Description ParentID
> 10 Great-Grandfather
> 1010 Grandfather 10
> 101010 1. Father A 1010
> 101011 2. Father B 1010
> 101012 4. Father C 1010
> 101013 5. Father D 1010
> 101015 3. Father E 1010
> 101018 Father F 1010
> 101019 6. Father G 1010
> 101020 Father H 1010
> 101021 Father I 1010
> 101022 2A. Father J 1010
> 10101010 2. Father K 101010
> Like the above, I have ID till 20 digits, which means, I have 10 levels.
>
> I want to populate the ID and name itself along with all the parents till
> the root for any particular level, which I am unable to create a concrete
> logic for.
>
> Am using this way to fetch respecting levels and populate them in the
> respective columns but not their parents -
>
> Present Logic ->
>
> FinalJoin_DF = spark.sql("select "
>   + "case when length(a.id)/2 = '1' then a.id
> else ' ' end as level_one_id, "
>   + "case when length(a.id)/2 = '1' then a.desc else ' ' end as
> level_one_name, "
>   + "case when length(a.id)/2 = '2' then a.id else ' ' end as
> level_two_id, "
>   + "case when length(a.id)/2 = '2' then a.desc else ' ' end as
> level_two_name, "
>   + "case when length(a.id)/2 = '3' then a.id
> else ' ' end as level_three_id, "
>   + "case when length(a.id)/2 = '3' then a.desc
> else ' ' end as level_three_name, "
>   + "case when length(a.id)/2 = '4' then a.id
> else ' ' end as level_four_id, "
>   + "case when length(a.id)/2 = '4' then a.desc
> else ' ' end as level_four_name, "
>   + "case when length(a.id)/2 = '5' then a.id
> else ' ' end as level_five_id, "
>   + "case when length(a.id)/2 = '5' then a.desc
> else ' ' end as level_five_name, "
>   + "case when length(a.id)/2 = '6' then a.id
> else ' ' end as level_six_id, "
>   + "case when length(a.id)/2 = '6' then a.desc else ' ' end as
> level_six_name, "
>   + "case when length(a.id)/2 = '7' then a.id else ' ' end as
> level_seven_id, "
>   + "case when length(a.id)/2 = '7' then a.desc
> else ' ' end as level_seven_name, "
>   + "case when length(a.id)/2 = '8' then a.id
> else ' ' end as level_eight_id, "
>   + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
> level_eight_name, "
>   + "case when length(a.id)/2 = '9' then a.id
> else ' ' end as level_nine_id, "
>   + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
> level_nine_name, "
>   + "case when length(a.id)/2 = '10' then a.id else ' ' end as
> level_ten_id, "
>   + "case when length(a.id)/2 = '10' then a.desc
> else ' ' end as level_ten_name "
>   + "from CategoryTempTable a")
>
>
> Can someone help me in also populating all the parents levels in the
> respective level ID and level name, please?
>
>
> Thanks,
> Aakash.
>
>
>


-- 
Best Regards,
Ayan Guha


Re: pyspark configuration with Juyter

2017-11-03 Thread Jeff Zhang
You are setting PYSPARK_DRIVER to jupyter, please set it to python exec file


anudeep 于2017年11月3日周五 下午7:31写道:

> Hello experts,
>
> I install jupyter notebook thorugh anacoda, set the pyspark driver to use
> jupyter notebook.
>
> I see the below issue when i try to open pyspark.
>
> anudeepg@datanode2 spark-2.1.0]$ ./bin/pyspark
> [I 07:29:53.184 NotebookApp] The port  is already in use, trying
> another port.
> [I 07:29:53.211 NotebookApp] JupyterLab alpha preview extension loaded
> from /home/anudeepg/anaconda2/lib/python2.7/site-packages/jupyterlab
> JupyterLab v0.27.0
> Known labextensions:
> [I 07:29:53.212 NotebookApp] Running the core application with no
> additional extensions or settings
> [I 07:29:53.214 NotebookApp] Serving notebooks from local directory:
> /opt/mapr/spark/spark-2.1.0
> [I 07:29:53.214 NotebookApp] 0 active kernels
> [I 07:29:53.214 NotebookApp] The Jupyter Notebook is running at:
> http://localhost:8889/?token=9aa5dc87cb5a6d987237f68e2f0b7e9c70a7f2e8c9a7cf2e
> [I 07:29:53.214 NotebookApp] Use Control-C to stop this server and shut
> down all kernels (twice to skip confirmation).
> [W 07:29:53.214 NotebookApp] No web browser found: could not locate
> runnable browser.
> [C 07:29:53.214 NotebookApp]
>
> Copy/paste this URL into your browser when you connect for the first
> time,
> to login with a token:
>
> http://localhost:8889/?token=9aa5dc87cb5a6d987237f68e2f0b7e9c70a7f2e8c9a7cf2e
>
>
> Can someone please help me here.
>
> Thanks!
> Anudeep
>
>


pyspark configuration with Juyter

2017-11-03 Thread anudeep
Hello experts,

I install jupyter notebook thorugh anacoda, set the pyspark driver to use
jupyter notebook.

I see the below issue when i try to open pyspark.

anudeepg@datanode2 spark-2.1.0]$ ./bin/pyspark
[I 07:29:53.184 NotebookApp] The port  is already in use, trying
another port.
[I 07:29:53.211 NotebookApp] JupyterLab alpha preview extension loaded from
/home/anudeepg/anaconda2/lib/python2.7/site-packages/jupyterlab
JupyterLab v0.27.0
Known labextensions:
[I 07:29:53.212 NotebookApp] Running the core application with no
additional extensions or settings
[I 07:29:53.214 NotebookApp] Serving notebooks from local directory:
/opt/mapr/spark/spark-2.1.0
[I 07:29:53.214 NotebookApp] 0 active kernels
[I 07:29:53.214 NotebookApp] The Jupyter Notebook is running at:
http://localhost:8889/?token=9aa5dc87cb5a6d987237f68e2f0b7e9c70a7f2e8c9a7cf2e
[I 07:29:53.214 NotebookApp] Use Control-C to stop this server and shut
down all kernels (twice to skip confirmation).
[W 07:29:53.214 NotebookApp] No web browser found: could not locate
runnable browser.
[C 07:29:53.214 NotebookApp]

Copy/paste this URL into your browser when you connect for the first
time,
to login with a token:

http://localhost:8889/?token=9aa5dc87cb5a6d987237f68e2f0b7e9c70a7f2e8c9a7cf2e


Can someone please help me here.

Thanks!
Anudeep


Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin
Write a UDF?

> On Oct 31, 2017, at 11:48, Aakash Basu  > wrote:
> 
> Hey all,
> 
> Any help in the below please?
> 
> Thanks,
> Aakash.
> 
> 
> -- Forwarded message --
> From: Aakash Basu  >
> Date: Tue, Oct 31, 2017 at 9:17 PM
> Subject: Regarding column partitioning IDs and names as per hierarchical 
> level SparkSQL
> To: user >
> 
> 
> Hi all,
> 
> I have to generate a table with Spark-SQL with the following columns -
> 
> 
> Level One Id: VARCHAR(20) NULL
> Level One Name: VARCHAR( 50) NOT NULL
> Level Two Id: VARCHAR( 20) NULL
> Level Two Name: VARCHAR(50) NULL
> Level Thr ee Id: VARCHAR(20) NULL
> Level Thr ee Name: VARCHAR(50) NULL
> Level Four Id: VARCHAR(20) NULL
> Level Four Name: VARCHAR( 50) NULL
> Level Five Id: VARCHAR(20) NULL
> Level Five Name: VARCHAR(50) NULL
> Level Six Id: VARCHAR(20) NULL
> Level Six Name: VARCHAR(50) NULL
> Level Seven Id: VARCHAR( 20) NULL
> Level Seven Name: VARCHAR(50) NULL
> Level Eight Id: VARCHAR( 20) NULL
> Level Eight Name: VARCHAR(50) NULL
> Level Nine Id: VARCHAR(20) NULL
> Level Nine Name: VARCHAR( 50) NULL
> Level Ten Id: VARCHAR(20) NULL
> Level Ten Name: VARCHAR(50) NULL
> 
> My input source has these columns -
> 
> 
> IDDescription ParentID
> 10Great-Grandfather
> 1010  Grandfather 10
> 1010101. Father A 1010
> 1010112. Father B 1010
> 1010124. Father C 1010
> 1010135. Father D 1010
> 1010153. Father E 1010
> 101018Father F1010
> 1010196. Father G 1010
> 101020Father H1010
> 101021Father I1010
> 1010222A. Father J1010
> 10101010  2. Father K 101010
> 
> Like the above, I have ID till 20 digits, which means, I have 10 levels.
> 
> I want to populate the ID and name itself along with all the parents till the 
> root for any particular level, which I am unable to create a concrete logic 
> for.
> 
> Am using this way to fetch respecting levels and populate them in the 
> respective columns but not their parents -
> 
> Present Logic ->
> 
> FinalJoin_DF = spark.sql("select "
>   + "case when length(a.id )/2 = '1' 
> then a.id  else ' ' end as level_one_id, "
> + "case when length(a.id )/2 = '1' then 
> a.desc else ' ' end as level_one_name, "
> + "case when length(a.id )/2 = '2' then 
> a.id  else ' ' end as level_two_id, "
> + "case when length(a.id )/2 = '2' then 
> a.desc else ' ' end as level_two_name, "
>   + "case when length(a.id )/2 = '3' 
> then a.id  else ' ' end as level_three_id, "
>   + "case when length(a.id )/2 = '3' 
> then a.desc else ' ' end as level_three_name, "
>   + "case when length(a.id )/2 = '4' 
> then a.id  else ' ' end as level_four_id, "
>   + "case when length(a.id )/2 = '4' 
> then a.desc else ' ' end as level_four_name, "
>   + "case when length(a.id )/2 = '5' 
> then a.id  else ' ' end as level_five_id, "
>   + "case when length(a.id )/2 = '5' 
> then a.desc else ' ' end as level_five_name, "
>   + "case when length(a.id )/2 = '6' 
> then a.id  else ' ' end as level_six_id, "
> + "case when length(a.id )/2 = '6' then 
> a.desc else ' ' end as level_six_name, "
> + "case when length(a.id )/2 = '7' then 
> a.id  else ' ' end as level_seven_id, "
>   + "case when length(a.id )/2 = '7' 
> then a.desc else ' ' end as level_seven_name, "
>   + "case when length(a.id )/2 = '8' 
> then a.id  else ' ' end as level_eight_id, "
> + "case when length(a.id )/2 = '8' then 
> a.desc else ' ' end as level_eight_name, "
>   + "case when length(a.id )/2 = '9' 
> then a.id  else ' ' end as level_nine_id, "
> + "case when length(a.id )/2 = '9' then 
> a.desc else ' ' end as level_nine_name, "
> + "case when length(a.id )/2 = '10' 
> then a.id  else ' ' end as level_ten_id, "
>   + "case when length(a.id )/2 = '10' 
> then a.desc else ' ' end as level_ten_name "
> + "from 

Re: How to get the data url

2017-11-03 Thread Jean Georges Perrin
I am a little confused by your question… Are you trying to ingest a file from 
S3?

If so… look for net.jgp.labs.spark on GitHub and look for 
net.jgp.labs.spark.l000_ingestion.l001_csv_in_progress.S3CsvToDataset 

You can modify the file as the keys are yours…

If you want to download first: look at 
net.jgp.labs.spark.l900_analytics.ListNCSchoolDistricts

jgp

> On Oct 29, 2017, at 22:20, onoke  > wrote:
> 
> Hi,
> 
> I am searching a useful API for getting a data URL that is accessed by a
> application on Spark.
> For example, when this URL is in a application
> 
>   new
> URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv 
> ")
> 
> How to get this url from using Spark API?
> I looked in org.apach.api.java and org.apache.spark.status.api.v1, but they
> do not provide any URL info.
> 
> Any advice are welcome.
> 
> -Keiji 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 



Re: Hi all,

2017-11-03 Thread Jean Georges Perrin
Hi Oren,

Why don’t you want to use a GroupBy? You can cache or checkpoint the result and 
use it in your process, keeping everything in Spark and avoiding 
save/ingestion...


> On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨oren.sha...@gmail.com⁩> wrote:
> 
> I have 2 spark jobs one is pre-process and the second is the process.
> Process job needs to calculate for each user in the data.
> I want  to avoid shuffle like groupBy so I think about to save the result of 
> the pre-process as bucket by user in Parquet or to re-partition by user and 
> save the result .
> 
> What is prefer ? and why 
> Thanks in advance,
> Oren


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org