date:20151125

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Ted Yu

trackStateByKey API is in branch-1.6

FYI

On Wed, Nov 25, 2015 at 6:03 AM, Todd Nist  wrote:

> Perhaps the new trackStateByKey targeted for very 1.6 may help you here.
> I'm not sure if it is part of 1.6 or not for sure as the jira does not
> specify a fixed version.  The jira describing it is here:
> https://issues.apache.org/jira/browse/SPARK-2629, and the design doc that
> discusses the API changes is here:
>
>
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#
>
> Look for the timeout  function:
>
> /**
>
>   * Set the duration of inactivity (i.e. no new data) after which a state
>
>   * can be terminated by the system. After this idle period, the system
>
>   * will mark the idle state as being timed out, and call the tracking
>
>   * function with State[S].isTimingOut() = true.
>
>   */
>
>  def timeout(duration: Duration): this.type
>
> -Todd
>
> On Wed, Nov 25, 2015 at 8:00 AM, diplomatic Guru  > wrote:
>
>> Hello,
>>
>> I know how I could clear the old state depending on the input value. If
>> some condition matches to determine that the state is old then set the
>> return null, will invalidate the record. But this is only feasible if a new
>> record arrives that matches the old key. What if no new data arrives for
>> the old data, how could I make that invalid.
>>
>> e.g.
>>
>> A key/Value arrives like this
>>
>> Key 12-11-2015:10:00: Value:test,1,2,12-11-2015:10:00
>>
>> Above key will be updated to state.
>>
>> Every time there is a value for this '12-11-2015:10:00' key, it will be
>> aggregated and updated. If the job is running for 24/7, then this state
>> will be kept forever until we restart the job. But I could have a
>> validation within the updateStateByKey function to check and delete the
>> record if value[3]< SYSTIME-1. But this only effective if a new record
>> arrives that matches the 12-11-2015:10:00 in the later days. What if no new
>> values are received for this key:12-11-2015:10:00. I assume it will remain
>> in the state, am I correct? if so the how do I clear the state?
>>
>> Thank you.
>>
>>
>>
>

data local read counter

2015-11-25 Thread Patcharee Thongtra


Hi,

Is there a counter for data local read? I understood that it is locality 
level counter, but it seems not.


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos

Hello,

I have a text file consisting of 483150 lines (wc -l "my_file.txt").

However when I read it using textFile:

%pyspark
rdd = sc.textFile("my_file.txt")
print rdd.count()

it returns 554420 lines. Any idea why this is happening? Is it using a
different new line delimiter and how this can be changed?

Thank you,
George

[ANNOUNCE] CFP open for ApacheCon North America 2016

2015-11-25 Thread Rich Bowen

Community growth starts by talking with those interested in your
project. ApacheCon North America is coming, are you?

We are delighted to announce that the Call For Presentations (CFP) is
now open for ApacheCon North America. You can submit your proposed
sessions at
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp
for big data talks and
http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp
for all other topics.

ApacheCon North America will be held in Vancouver, Canada, May 9-13th
2016. ApacheCon has been running every year since 2000, and is the place
to build your project communities.

While we will consider individual talks we prefer to see related
sessions that are likely to draw users and community members. When
submitting your talk work with your project community and with related
communities to come up with a full program that will walk attendees
through the basics and on into mastery of your project in example use
cases. Content that introduces what's new in your latest release is also
of particular interest, especially when it builds upon existing well
know application models. The goal should be to showcase your project in
ways that will attract participants and encourage engagement in your
community, Please remember to involve your whole project community (user
and dev lists) when building content. This is your chance to create a
project specific event within the broader ApacheCon conference.

Content at ApacheCon North America will be cross-promoted as
mini-conferences, such as ApacheCon Big Data, and ApacheCon Mobile, so
be sure to indicate which larger category your proposed sessions fit into.

Finally, please plan to attend ApacheCon, even if you're not proposing a
talk. The biggest value of the event is community building, and we count
on you to make it a place where your project community is likely to
congregate, not just for the technical content in sessions, but for
hackathons, project summits, and good old fashioned face-to-face networking.

-- 
rbo...@apache.org
http://apache.org/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Partial data transfer from one partition to other

2015-11-25 Thread Samarth Rastogi

I have a use case where I partition my data on key 1. But after sometime this 
key 1 can change to key 2.
So now I want all the new data to go to key 2 partition and also the old key 1 
data to key 2 partition.
Is there a way to copy partial data from one partition to another partition.

Samarth Rastogi
Developer
T
M
srast...@thunderhead.com

[https://hmn-uploads-eu.s3.amazonaws.com/thunderhead-production/uploads/2015/07/email-sign-sponsor-v3.jpg]
 

Thunderhead is the trading name of Thunderhead Limited which is registered in 
England under No. 4303041 whose registered office is at Ingeni Building, 17 
Broadwick Street, Soho, London. W1F 0DJ.

The contents of this e-mail are intended for the named addressee only. It 
contains confidential information. Unless you are the named addressee or an 
authorized designee, you may not copy or use it, or disclose it to anyone else. 
If you received it in error please notify us immediately and then destroy it.

JNI native linrary problem java.lang.UnsatisfiedLinkError

2015-11-25 Thread Oriol López Massaguer

Hi;

I try to use a natie library inside Spark.

I


Oriol.

Re: Spark 1.5.2 JNI native library java.lang.UnsatisfiedLinkError

2015-11-25 Thread Ted Yu

In your spark-env, did you set LD_LIBRARY_PATH ?

Cheers

On Wed, Nov 25, 2015 at 7:32 AM, Oriol López Massaguer <
oriol.lo...@gmail.com> wrote:

> Hello;
>
> I'm trying to use a native library in Spark.
>
> I was using a simple standalone cluster with one master and worker.
>
> According to the documentation I edited the spark-defautls.conf by setting:
>
> spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar
> spark.driver.extraLibraryPath=/opt/eTOX_spark/lib/
> spark.executor.extraLibraryPath=/opt/eTOX_spark/lib/
>
> In the path /opt/eTOX_spark/lib/ there are 3 so files wich are wrapped in
> org.RDKit.jar.
>
> But when I try so submit a job that uses the native library I get:
>
> Exception in thread "main" java.lang.UnsatisfiedLinkError:
> org.RDKit.RDKFuncsJNI.RWMol_MolFromSmiles__SWIG_3(Ljava/lang/String;)J
> at org.RDKit.RDKFuncsJNI.RWMol_MolFromSmiles__SWIG_3(Native Method)
> at org.RDKit.RWMol.MolFromSmiles(RWMol.java:426)
> at models.spark.sources.eTOX_DB$.main(eTOX.scala:54)
> at models.spark.sources.eTOX_DB.main(eTOX.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> I use the submit.sh with the following parameters:
>
>  /opt/spark/bin/spark-submit --verbose --class
> "models.spark.sources.eTOX_DB"  --master
> spark://localhost.localdomain:7077
> target/scala-2.10/etox_spark_2.10-1.0.jar
>
> the full output is:
>
> Using properties file: /opt/spark/conf/spark-defaults.conf
> Adding default property: spark.driver.extraLibraryPath=/opt/eTOX_spark/lib/
> Adding default property:
> spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar
> Adding default property:
> spark.executor.extraLibraryPath=/opt/eTOX_spark/lib/
> Parsed arguments:
>   master  spark://localhost.localdomain:7077
>   deployMode  null
>   executorMemory  null
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  /opt/spark/conf/spark-defaults.conf
>   driverMemorynull
>   driverCores null
>   driverExtraClassPath/opt/eTOX_spark/lib/org.RDKit.jar
>   driverExtraLibraryPath  /opt/eTOX_spark/lib/
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   models.spark.sources.eTOX_DB
>   primaryResource
> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
>   namemodels.spark.sources.eTOX_DB
>   childArgs   []
>   jarsnull
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
>
> Spark properties used, including those specified through
>  --conf and those from the properties file
> /opt/spark/conf/spark-defaults.conf:
>   spark.executor.extraLibraryPath -> /opt/eTOX_spark/lib/
>   spark.driver.extraLibraryPath -> /opt/eTOX_spark/lib/
>   spark.driver.extraClassPath -> /opt/eTOX_spark/lib/org.RDKit.jar
>
>
> Main class:
> models.spark.sources.eTOX_DB
> Arguments:
>
> System properties:
> spark.executor.extraLibraryPath -> /opt/eTOX_spark/lib/
> spark.driver.extraLibraryPath -> /opt/eTOX_spark/lib/
> SPARK_SUBMIT -> true
> spark.app.name -> models.spark.sources.eTOX_DB
> spark.jars ->
> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
> spark.submit.deployMode -> client
> spark.master -> spark://localhost.localdomain:7077
> spark.driver.extraClassPath -> /opt/eTOX_spark/lib/org.RDKit.jar
> Classpath elements:
> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
>
>
> Buffer(/opt/jdk1.8.0_45/jre/lib/amd64/libzip.so)
> Loading libraries
> Buffer(/opt/jdk1.8.0_45/jre/lib/amd64/libzip.so, /opt/eTOX_spark/lib/
> libboost_thread.1.48.0.so, /opt/eTOX_spark/lib/libboost_system.1.48.0.so,
> /opt/eTOX_spark/lib/libGraphMolWrap.so)
> Loading libraries
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/11/25 16:27:32 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
> 15/11/25 16:27:33 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/11/25 16:27:33

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens

Thanks, the issue was indeed the dfs replication factor. To fix it without
entirely clearing out HDFS and rebooting, I first ran
hdfs dfs -setrep -R -w 1 /
to reduce all the current files' replication factor to 1 recursively from
the root, then I changed the dfs.replication factor in
ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-all.sh
and start-all.sh

Alex

On Tue, Nov 24, 2015 at 10:43 PM, Ye Xianjin  wrote:

> Hi AlexG:
>
> Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 *
> 3 = 11.4TB.
>
> --
> Ye Xianjin
> Sent with Sparrow 
>
> On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
>
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I
> see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
>
> I'm using the spark-ec2 script to launch, with the command
>
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
>
> and am not modifying any configuration files for Hadoop.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Spark 1.5.2 JNI native library java.lang.UnsatisfiedLinkError

2015-11-25 Thread Oriol López Massaguer

Hello;

I'm trying to use a native library in Spark.

I was using a simple standalone cluster with one master and worker.

According to the documentation I edited the spark-defautls.conf by setting:

spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar
spark.driver.extraLibraryPath=/opt/eTOX_spark/lib/
spark.executor.extraLibraryPath=/opt/eTOX_spark/lib/

In the path /opt/eTOX_spark/lib/ there are 3 so files wich are wrapped in
org.RDKit.jar.

But when I try so submit a job that uses the native library I get:

Exception in thread "main" java.lang.UnsatisfiedLinkError:
org.RDKit.RDKFuncsJNI.RWMol_MolFromSmiles__SWIG_3(Ljava/lang/String;)J
at org.RDKit.RDKFuncsJNI.RWMol_MolFromSmiles__SWIG_3(Native Method)
at org.RDKit.RWMol.MolFromSmiles(RWMol.java:426)
at models.spark.sources.eTOX_DB$.main(eTOX.scala:54)
at models.spark.sources.eTOX_DB.main(eTOX.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I use the submit.sh with the following parameters:

 /opt/spark/bin/spark-submit --verbose --class
"models.spark.sources.eTOX_DB"  --master
spark://localhost.localdomain:7077
target/scala-2.10/etox_spark_2.10-1.0.jar

the full output is:

Using properties file: /opt/spark/conf/spark-defaults.conf
Adding default property: spark.driver.extraLibraryPath=/opt/eTOX_spark/lib/
Adding default property:
spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar
Adding default property:
spark.executor.extraLibraryPath=/opt/eTOX_spark/lib/
Parsed arguments:
  master  spark://localhost.localdomain:7077
  deployMode  null
  executorMemory  null
  executorCores   null
  totalExecutorCores  null
  propertiesFile  /opt/spark/conf/spark-defaults.conf
  driverMemorynull
  driverCores null
  driverExtraClassPath/opt/eTOX_spark/lib/org.RDKit.jar
  driverExtraLibraryPath  /opt/eTOX_spark/lib/
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   models.spark.sources.eTOX_DB
  primaryResource
file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
  namemodels.spark.sources.eTOX_DB
  childArgs   []
  jarsnull
  packagesnull
  packagesExclusions  null
  repositoriesnull
  verbose true

Spark properties used, including those specified through
 --conf and those from the properties file
/opt/spark/conf/spark-defaults.conf:
  spark.executor.extraLibraryPath -> /opt/eTOX_spark/lib/
  spark.driver.extraLibraryPath -> /opt/eTOX_spark/lib/
  spark.driver.extraClassPath -> /opt/eTOX_spark/lib/org.RDKit.jar


Main class:
models.spark.sources.eTOX_DB
Arguments:

System properties:
spark.executor.extraLibraryPath -> /opt/eTOX_spark/lib/
spark.driver.extraLibraryPath -> /opt/eTOX_spark/lib/
SPARK_SUBMIT -> true
spark.app.name -> models.spark.sources.eTOX_DB
spark.jars -> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
spark.submit.deployMode -> client
spark.master -> spark://localhost.localdomain:7077
spark.driver.extraClassPath -> /opt/eTOX_spark/lib/org.RDKit.jar
Classpath elements:
file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar


Buffer(/opt/jdk1.8.0_45/jre/lib/amd64/libzip.so)
Loading libraries
Buffer(/opt/jdk1.8.0_45/jre/lib/amd64/libzip.so, /opt/eTOX_spark/lib/
libboost_thread.1.48.0.so, /opt/eTOX_spark/lib/libboost_system.1.48.0.so,
/opt/eTOX_spark/lib/libGraphMolWrap.so)
Loading libraries
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/11/25 16:27:32 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
15/11/25 16:27:33 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/11/25 16:27:33 WARN Utils: Your hostname, localhost.localdomain resolves
to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface
enp0s3)
15/11/25 16:27:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
15/11/25 16:27:33 INFO SecurityManager: Changing view acls to: user
15/11/25 16:27:33 INFO SecurityManager: Changing modify acls to: user

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin

Turning off replication sacrifices durability of your data, so if a node
goes down the data is lost - in case that's not obvious.
On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens  wrote:

> Thanks, the issue was indeed the dfs replication factor. To fix it without
> entirely clearing out HDFS and rebooting, I first ran
> hdfs dfs -setrep -R -w 1 /
> to reduce all the current files' replication factor to 1 recursively from
> the root, then I changed the dfs.replication factor in
> ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-all.sh
> and start-all.sh
>
> Alex
>
> On Tue, Nov 24, 2015 at 10:43 PM, Ye Xianjin  wrote:
>
>> Hi AlexG:
>>
>> Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 *
>> 3 = 11.4TB.
>>
>> --
>> Ye Xianjin
>> Sent with Sparrow 
>>
>> On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
>>
>> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
>> cluster
>> with 16.73 Tb storage, using
>> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
>> Nothing else was stored in the HDFS, but after completing the download,
>> the
>> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I
>> see
>> that the dataset only takes up 3.8 Tb as expected. I navigated through the
>> entire HDFS hierarchy from /, and don't see where the missing space is.
>> Any
>> ideas what is going on and how to rectify it?
>>
>> I'm using the spark-ec2 script to launch, with the command
>>
>> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
>> --placement-group=pcavariants --copy-aws-credentials
>> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
>> conversioncluster
>>
>> and am not modifying any configuration files for Hadoop.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>

Re: Spark Streaming idempotent writes to HDFS

2015-11-25 Thread Steve Loughran


On 25 Nov 2015, at 07:01, Michael > 
wrote:

so basically writing them into a temporary directory named with the batch time 
and then move the files to their destination on success ? I wished there was a 
way to skip moving files around and be able to set the output filenames.


that's how everything else does it: relies on rename() being atomic and O(1) on 
HDFS. Just create the temp dir with the same parent dir as the destination, so 
in encrypted HDFS they are both in the same encryption zone.

And know that renames in S3n/s3a are neither atomic nor O(1), so it's not how 
you commit things there

Thanks Burak :)

-Michael


On Mon, Nov 23, 2015, at 09:19 PM, Burak Yavuz wrote:
Not sure if it would be the most efficient, but maybe you can think of the 
filesystem as a key value store, and write each batch to a sub-directory, where 
the directory name is the batch time. If the directory already exists, then you 
shouldn't write it. Then you may have a following batch job that will coalesce 
files, in order to "close the day".

Burak

On Mon, Nov 23, 2015 at 8:58 PM, Michael 
> wrote:
Hi all,

I'm working on project with spark streaming, the goal is to process log
files from S3 and save them on hadoop to later analyze them with
sparkSQL.
Everything works well except when I kill the spark application and
restart it: it picks up from the latest processed batch and reprocesses
it which results in duplicate data on hdfs.

How can I make the writing step on hdfs idempotent ? I couldn't find any
way to control for example the filenames of the parquet files being
written, the idea being to include the batch time so that the same batch
gets written always on the same path.
I've also tried with mode("overwrite") but looks that each batch gets
written on the same file every time.
Any help would be greatly appreciated.

Thanks,
Michael

--

def save_rdd(batch_time, rdd):
sqlContext = SQLContext(rdd.context)
df = sqlContext.createDataFrame(rdd, log_schema)

df.write.mode("append").partitionBy("log_date").parquet(hdfs_dest_directory)

def create_ssc(checkpoint_dir, spark_master):

sc = SparkContext(spark_master, app_name)
ssc = StreamingContext(sc, batch_interval)
ssc.checkpoint(checkpoint_dir)

parsed = dstream.map(lambda line: log_parser(line))
parsed.foreachRDD(lambda batch_time, rdd: save_rdd(batch_time, rdd)

return ssc

ssc = StreamingContext.getOrCreate(checkpoint_dir, lambda:
create_ssc(checkpoint_dir, spark_master)
ssc.start()
ssc.awaitTermination()

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: How to run two operations on the same RDD simultaneously

2015-11-25 Thread Jay Luan

Ah, thank you so much, this is perfect

On Fri, Nov 20, 2015 at 3:48 PM, Ali Tajeldin EDU 
wrote:

> You can try to use an Accumulator (
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator)
> to keep count in map1.  Note that the final count may be higher than the
> number of records if there were some retries along the way.
> --
> Ali
>
> On Nov 20, 2015, at 3:38 PM, jluan  wrote:
>
> As far as I understand, operations on rdd's usually come in the form
>
> rdd => map1 => map2 => map2 => (maybe collect)
>
> If I would like to also count my RDD, is there any way I could include this
> at map1? So that as spark runs through map1, it also does a count? Or would
> count need to be a separate operation such that I would have to run through
> my dataset again. My dataset is really memory intensive so I'd rather not
> cache() it if possible.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-two-operations-on-the-same-RDD-simultaneously-tp25441.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath

Hi,

I am trying to add the row number to a spark dataframe.
This is my dataframe:

scala> df.printSchema
root
|-- line: string (nullable = true)

I tried to use df.withColumn but I am getting below exception.

scala> df.withColumn("row",rowNumber)
org.apache.spark.sql.AnalysisException: unresolved operator 'Project
[line#2326,'row_number() AS row#2327];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

Also, is it possible to add a column from one dataframe to another?
something like

scala> df.withColumn("line2",df2("line"))

org.apache.spark.sql.AnalysisException: resolved attribute(s)
line#2330 missing from line#2326 in operator !Project
[line#2326,line#2330 AS line2#2331];



Thanks and Regards,
Vishnu Viswanath
*www.vishnuviswanath.com *

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh

Thanks Ted.

 

I have the jar file scala-compiler-2.10.4.jar as well

 

pwd

/

find ./ -name scala-compiler-2.10.4.jar

./usr/lib/spark/build/zinc-0.3.5.3/lib/scala-compiler-2.10.4.jar

./usr/lib/spark/build/apache-maven-3.3.3/lib/scala-compiler-2.10.4.jar

./root/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

 

Sounds like (?) because I am running the maven command as root, it cannot find 
that file!!

 

Do I need to add it somewhere or set it up on the PATH/CLASSPATH?

 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: 25 November 2015 22:35
To: Mich Talebzadeh 
Cc: user 
Subject: Re: Building Spark without hive libraries

 

bq. ^[[0m[^[[31merror^[[0m] ^[[0mRequired file not found: 
scala-compiler-2.10.4.jar^[[0m

 

Can you search for the above jar ?

 

I found two locally:

 

/home/hbase/.ivy2/cache/org.scala-lang/scala-compiler/jars/scala-compiler-2.10.4.jar

/home/hbase/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

 

On Wed, Nov 25, 2015 at 2:30 PM, Mich Talebzadeh  > wrote:

Thanks Ted.

 

I ran maven in debug mode as follows

 

build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean 
package > log

Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn

 

Still cannot determine the cause of this error.

 

Thanks,

 

Mich

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ted Yu [mailto:yuzhih...@gmail.com  ] 
Sent: 25 November 2015 21:52
To: Mich Talebzadeh  >
Cc: user  >
Subject: Re: Building Spark without hive libraries

 

Take a look at install_zinc() in build/mvn

 

Cheers

 

On Wed, Nov 25, 2015 at 1:30 PM, Mich Talebzadeh  > wrote:

Hi,

 

I am trying to build sparc from the source and not using Hive. I am getting 

 

[error] Required file not found: scala-compiler-2.10.4.jar

[error] See zinc -help for information about locating necessary files

 

I have to run this as root otherwise build does not progress. Any help is 
appreciated.

 

 

-bash-3.2#  ./make-distribution.sh --name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"

+++ dirname ./make-distribution.sh

++ cd .

++ pwd

+ SPARK_HOME=/usr/lib/spark

+ DISTDIR=/usr/lib/spark/dist

+ SPARK_TACHYON=false

+ TACHYON_VERSION=0.7.1

+ TACHYON_TGZ=tachyon-0.7.1-bin.tar.gz

+ 
TACHYON_URL=https://github.com/amplab/tachyon/releases/download/v0.7.1/tachyon-0.7.1-bin.tar.gz

+ MAKE_TGZ=false

+ NAME=none

+ MVN=/usr/lib/spark/build/mvn

+ ((  4  ))

+ case $1 in

+ NAME=hadoop2-without-hive

+ shift

+ shift

+ ((  2  ))

+ case $1 in

+ MAKE_TGZ=true

+ shift

+ ((  1  ))

+ case $1 in

+ break

+ '[' -z /usr/java/latest ']'

+ '[' -z /usr/java/latest ']'

++ command -v git

+ '[' ']'

++ command -v /usr/lib/spark/build/mvn

+ '[' '!' /usr/lib/spark/build/mvn ']'

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=project.version 
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ tail -n 1

+ VERSION=1.5.2

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=scala.binary.version 
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ tail -n 1

+ SCALA_VERSION=2.10

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=hadoop.version 
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ tail -n 1

+ SPARK_HADOOP_VERSION=2.6.0

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=project.activeProfiles 
-pl sql/hive -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ fgrep --count 'hive'

++ echo -n

+ SPARK_HIVE=0

+ '[' hadoop2-without-hive == none ']'

+ echo 'Spark version is

Re: Building Spark without hive libraries

2015-11-25 Thread Ted Yu

bq. I have to run this as root otherwise build does not progress

I build Spark as non-root user and don't problem.

I suggest you dig a little bit to see what was stalling running as non-root
user.

On Wed, Nov 25, 2015 at 2:48 PM, Mich Talebzadeh 
wrote:

> Thanks Ted.
>
>
>
> I have the jar file scala-compiler-2.10.4.jar as well
>
>
>
> pwd
>
> /
>
> *find ./ -name scala-compiler-2.10.4.jar*
>
> ./usr/lib/spark/build/zinc-0.3.5.3/lib/scala-compiler-2.10.4.jar
>
> ./usr/lib/spark/build/apache-maven-3.3.3/lib/scala-compiler-2.10.4.jar
>
>
> ./root/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
>
>
>
> Sounds like (?) because I am running the maven command as root, it cannot
> find that file!!
>
>
>
> Do I need to add it somewhere or set it up on the PATH/CLASSPATH?
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* 25 November 2015 22:35
>
> *To:* Mich Talebzadeh 
> *Cc:* user 
> *Subject:* Re: Building Spark without hive libraries
>
>
>
> bq. ^[[0m[^[[31merror^[[0m] ^[[0mRequired file not found:
> scala-compiler-2.10.4.jar^[[0m
>
>
>
> Can you search for the above jar ?
>
>
>
> I found two locally:
>
>
>
>
> /home/hbase/.ivy2/cache/org.scala-lang/scala-compiler/jars/scala-compiler-2.10.4.jar
>
>
> /home/hbase/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
>
>
>
> On Wed, Nov 25, 2015 at 2:30 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Ted.
>
>
>
> I ran maven in debug mode as follows
>
>
>
> *build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
> package > log*
>
> Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn
>
>
>
> Still cannot determine the cause of this error.
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* 25 November 2015 21:52
> *To:* Mich Talebzadeh 
> *Cc:* user 
> *Subject:* Re: Building Spark without hive libraries
>
>
>
> Take a look at install_zinc() in build/mvn
>
>
>
> Cheers
>
>
>
> On Wed, Nov 25, 2015 at 1:30 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> I am trying to build sparc from the source and not using Hive. I am
> getting
>
>
>
> [error] Required file not found: scala-compiler-2.10.4.jar
>
> [error] See zinc -help for information about locating necessary files
>
>
>
> I have to run this as root otherwise build does not progress. Any help is
> appreciated.
>
>
>
>
>
> -bash-3.2#  ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"
>
> +++ dirname ./make-distribution.sh
>
> ++ cd .
>
> ++ pwd
>
> + SPARK_HOME=/usr/lib/spark
>
> + DISTDIR=/usr/lib/spark/dist
>
> + SPARK_TACHYON=false
>
> + TACHYON_VERSION=0.7.1
>
> + TACHYON_TGZ=tachyon-0.7.1-bin.tar.gz
>
> + TACHYON_URL=
> https://github.com/amplab/tachyon/releases/download/v0.7.1/tachyon-0.7.1-bin.tar.gz
>
> + MAKE_TGZ=false
>
> + NAME=none
>
> + MVN=/usr/lib/spark/build/mvn
>
> + ((  4  ))
>
> + case $1 in
>
> + NAME=hadoop2-without-hive
>
> + shift
>
> + shift
>
> + ((  2  ))
>
> + case $1 in
>
> + MAKE_TGZ=true
>
> + shift
>
> + ((  1  ))
>
> + case $1 in
>
> + break
>
> + '[' -z /usr/java/latest ']'
>
> + '[' -z /usr/java/latest ']'
>
> ++ command -v git
>
> + '[' ']'
>
> ++ command -v /usr/lib/spark/build/mvn
>
> + '[' '!' /usr/lib/spark/build/mvn ']'
>
> ++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=project.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + VERSION=1.5.2
>
> ++ /usr/lib/spark/build/mvn help:evaluate
> -Dexpression=scala.binary.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + SCALA_VERSION=2.10
>
> ++

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh

Looks better Ted bow :)

 

[INFO] 

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [ 39.937 s]

[INFO] Spark Project Launcher . SUCCESS [ 44.718 s]

[INFO] Spark Project Networking ... SUCCESS [ 11.294 s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  4.720 s]

[INFO] Spark Project Unsafe ... SUCCESS [ 10.705 s]

[INFO] Spark Project Core . SUCCESS [02:52 min]

[INFO] Spark Project Bagel  SUCCESS [  5.937 s]

[INFO] Spark Project GraphX ... SUCCESS [ 15.977 s]

[INFO] Spark Project Streaming  SUCCESS [ 36.453 s]

[INFO] Spark Project Catalyst . SUCCESS [ 54.381 s]

[INFO] Spark Project SQL .. SUCCESS [01:07 min]

[INFO] Spark Project ML Library ... SUCCESS [01:22 min]

[INFO] Spark Project Tools  SUCCESS [  2.493 s]

[INFO] Spark Project Hive . SUCCESS [ 58.496 s]

[INFO] Spark Project REPL . SUCCESS [  9.278 s]

[INFO] Spark Project YARN . SUCCESS [ 12.424 s]

[INFO] Spark Project Assembly . SUCCESS [01:51 min]

[INFO] Spark Project External Twitter . SUCCESS [  7.604 s]

[INFO] Spark Project External Flume Sink .. SUCCESS [  7.580 s]

[INFO] Spark Project External Flume ... SUCCESS [  9.526 s]

[INFO] Spark Project External Flume Assembly .. SUCCESS [  3.163 s]

[INFO] Spark Project External MQTT  SUCCESS [ 31.774 s]

[INFO] Spark Project External MQTT Assembly ... SUCCESS [  8.698 s]

[INFO] Spark Project External ZeroMQ .. SUCCESS [  6.992 s]

[INFO] Spark Project External Kafka ... SUCCESS [ 11.487 s]

[INFO] Spark Project Examples . SUCCESS [02:12 min]

[INFO] Spark Project External Kafka Assembly .. SUCCESS [  9.046 s]

[INFO] Spark Project YARN Shuffle Service . SUCCESS [  6.097 s]

[INFO] 

[INFO] BUILD SUCCESS

[INFO] 

[INFO] Total time: 16:16 min

[INFO] Finished at: 2015-11-25T23:34:35+00:00

[INFO] Final Memory: 90M/1312M

[INFO] 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 25 November 2015 23:08
To: 'Ted Yu' 
Cc: 'user' 
Subject: RE: Building Spark without hive libraries

 

Yep.

 

The user hduser was using the wrong version of maven

 

hduser@rhes564::/usr/lib/spark> build/mvn -X -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -DskipTests clean package > log

Using `mvn` from path: /usr/local/apache-maven/apache-maven-3.3.1/bin/mvn

 

 

WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed 
with message:

Detected Maven Version: 3.3.1 is not in the allowed range 3.3.3.

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase

Re: Adding new column to Dataframe

2015-11-25 Thread Jeff Zhang

>>> I tried to use df.withColumn but I am getting below exception.

What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
for generating id

>>> Also, is it possible to add a column from one dataframe to another?

You can't, because how can you add one dataframe to another if they have
different number of rows. You'd better to use join to correlate 2 data
frames.

On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Hi,
>
> I am trying to add the row number to a spark dataframe.
> This is my dataframe:
>
> scala> df.printSchema
> root
> |-- line: string (nullable = true)
>
> I tried to use df.withColumn but I am getting below exception.
>
> scala> df.withColumn("row",rowNumber)
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> [line#2326,'row_number() AS row#2327];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>
> Also, is it possible to add a column from one dataframe to another?
> something like
>
> scala> df.withColumn("line2",df2("line"))
>
> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
> line2#2331];
>
> 
>
> Thanks and Regards,
> Vishnu Viswanath
> *www.vishnuviswanath.com *
>



-- 
Best Regards

Jeff Zhang

Re: Building Spark without hive libraries

2015-11-25 Thread Ted Yu

bq. ^[[0m[^[[31merror^[[0m] ^[[0mRequired file not found:
scala-compiler-2.10.4.jar^[[0m

Can you search for the above jar ?

I found two locally:

/home/hbase/.ivy2/cache/org.scala-lang/scala-compiler/jars/scala-compiler-2.10.4.jar
/home/hbase/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

On Wed, Nov 25, 2015 at 2:30 PM, Mich Talebzadeh 
wrote:

> Thanks Ted.
>
>
>
> I ran maven in debug mode as follows
>
>
>
> *build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
> package > log*
>
> Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn
>
>
>
> Still cannot determine the cause of this error.
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* 25 November 2015 21:52
> *To:* Mich Talebzadeh 
> *Cc:* user 
> *Subject:* Re: Building Spark without hive libraries
>
>
>
> Take a look at install_zinc() in build/mvn
>
>
>
> Cheers
>
>
>
> On Wed, Nov 25, 2015 at 1:30 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> I am trying to build sparc from the source and not using Hive. I am
> getting
>
>
>
> [error] Required file not found: scala-compiler-2.10.4.jar
>
> [error] See zinc -help for information about locating necessary files
>
>
>
> I have to run this as root otherwise build does not progress. Any help is
> appreciated.
>
>
>
>
>
> -bash-3.2#  ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"
>
> +++ dirname ./make-distribution.sh
>
> ++ cd .
>
> ++ pwd
>
> + SPARK_HOME=/usr/lib/spark
>
> + DISTDIR=/usr/lib/spark/dist
>
> + SPARK_TACHYON=false
>
> + TACHYON_VERSION=0.7.1
>
> + TACHYON_TGZ=tachyon-0.7.1-bin.tar.gz
>
> + TACHYON_URL=
> https://github.com/amplab/tachyon/releases/download/v0.7.1/tachyon-0.7.1-bin.tar.gz
>
> + MAKE_TGZ=false
>
> + NAME=none
>
> + MVN=/usr/lib/spark/build/mvn
>
> + ((  4  ))
>
> + case $1 in
>
> + NAME=hadoop2-without-hive
>
> + shift
>
> + shift
>
> + ((  2  ))
>
> + case $1 in
>
> + MAKE_TGZ=true
>
> + shift
>
> + ((  1  ))
>
> + case $1 in
>
> + break
>
> + '[' -z /usr/java/latest ']'
>
> + '[' -z /usr/java/latest ']'
>
> ++ command -v git
>
> + '[' ']'
>
> ++ command -v /usr/lib/spark/build/mvn
>
> + '[' '!' /usr/lib/spark/build/mvn ']'
>
> ++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=project.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + VERSION=1.5.2
>
> ++ /usr/lib/spark/build/mvn help:evaluate
> -Dexpression=scala.binary.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + SCALA_VERSION=2.10
>
> ++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=hadoop.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + SPARK_HADOOP_VERSION=2.6.0
>
> ++ /usr/lib/spark/build/mvn help:evaluate
> -Dexpression=project.activeProfiles -pl sql/hive
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ fgrep --count 'hive'
>
> ++ echo -n
>
> + SPARK_HIVE=0
>
> + '[' hadoop2-without-hive == none ']'
>
> + echo 'Spark version is 1.5.2'
>
> Spark version is 1.5.2
>
> + '[' true == true ']'
>
> + echo 'Making spark-1.5.2-bin-hadoop2-without-hive.tgz'
>
> Making spark-1.5.2-bin-hadoop2-without-hive.tgz
>
> + '[' false == true ']'
>
> + echo 'Tachyon Disabled'
>
> Tachyon Disabled
>
> + cd /usr/lib/spark
>
> + export 'MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
> -XX:ReservedCodeCacheSize=512m'
>
> + MAVEN_OPTS='-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m'
>
> + BUILD_COMMAND=("$MVN" clean package -DskipTests $@)
>
> + echo -e '\nBuilding with...'
>
>
>
> Building with...
>
> + echo -e '$ /usr/lib/spark/build/mvn' clean package -DskipTests
> '-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided\n'
>
> $ /usr/lib/spark/build/mvn clean package -DskipTests
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
>
>
> + /usr/lib/spark/build/mvn clean package -DskipTests
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn
>
> [INFO] Scanning for projects...
>
> [INFO]
> 
>
> [INFO] Reactor Build Order:
>
> [INFO]
>
> [INFO] Spark Project Parent POM
>
>

Error in block pushing thread puts the KinesisReceiver in a stuck state

2015-11-25 Thread Spark Newbie

Hi Spark users,

I have been seeing this issue where receivers enter a "stuck" state after
it encounters a the following exception "Error in block pushing thread -
java.util.concurrent.TimeoutException: Futures timed out".
I am running the application on spark-1.4.1 and using kinesis-asl-1.4.

When this happens, the observation is that the
Kinesis.ProcessTask.shard.MillisBehindLatest metric does not get
published anymore, when I look at cloudwatch, which indicates that the
workers associated with the receiver are not checkpointing any more for the
shards that they were reading from.

This seems like a bug in to BlockGenerator code , here -
https://github.com/apache/spark/blob/branch-1.4/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala#L171
when pushBlock encounters an exception, in this case the TimeoutException,
it stops pushing blocks. Is this really expected behavior?

Has anyone else seen this error and have you also seen the issue where
receivers stop receiving records? I'm also trying to find the root cause
for the TimeoutException. If anyone has an idea on this please share.

Thanks,

Bharath

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh

Yep.

 

The user hduser was using the wrong version of maven

 

hduser@rhes564::/usr/lib/spark> build/mvn -X -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -DskipTests clean package > log

Using `mvn` from path: /usr/local/apache-maven/apache-maven-3.3.1/bin/mvn

 

 

WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed 
with message:

Detected Maven Version: 3.3.1 is not in the allowed range 3.3.3.

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: 25 November 2015 22:54
To: Mich Talebzadeh 
Cc: user 
Subject: Re: Building Spark without hive libraries

 

bq. I have to run this as root otherwise build does not progress

 

I build Spark as non-root user and don't problem.

 

I suggest you dig a little bit to see what was stalling running as non-root 
user.

 

On Wed, Nov 25, 2015 at 2:48 PM, Mich Talebzadeh  > wrote:

Thanks Ted.

 

I have the jar file scala-compiler-2.10.4.jar as well

 

pwd

/

find ./ -name scala-compiler-2.10.4.jar

./usr/lib/spark/build/zinc-0.3.5.3/lib/scala-compiler-2.10.4.jar

./usr/lib/spark/build/apache-maven-3.3.3/lib/scala-compiler-2.10.4.jar

./root/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

 

Sounds like (?) because I am running the maven command as root, it cannot find 
that file!!

 

Do I need to add it somewhere or set it up on the PATH/CLASSPATH?

 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ted Yu [mailto:yuzhih...@gmail.com  ] 
Sent: 25 November 2015 22:35


To: Mich Talebzadeh  >
Cc: user  >
Subject: Re: Building Spark without hive libraries

 

bq. ^[[0m[^[[31merror^[[0m] ^[[0mRequired file not found: 
scala-compiler-2.10.4.jar^[[0m

 

Can you search for the above jar ?

 

I found two locally:

 

/home/hbase/.ivy2/cache/org.scala-lang/scala-compiler/jars/scala-compiler-2.10.4.jar

/home/hbase/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

 

On Wed, Nov 25, 2015 at 2:30 PM, Mich Talebzadeh  > wrote:

Thanks Ted.

 

I ran maven in debug mode as follows

 

build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean 
package > log

Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn

 

Still cannot determine the cause of this error.

 

Thanks,

 

Mich

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ted Yu [mailto:yuzhih...@gmail.com  ] 
Sent: 25 November 2015 21:52
To: Mich Talebzadeh

SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Spark Newbie

Hi Spark users,

I'm seeing the below exceptions once in a while which causes tasks to fail
(even after retries, so it is a non recoverable exception I think), hence
stage fails and then the job gets aborted.

Exception ---
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_10_piece0 of broadcast_10

Any idea why this exception occurs and how to avoid/handle these
exceptions? Please let me know if you have seen this exception and know a
fix for it.

Thanks,
Bharath

Re: SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Ted Yu

Which Spark release are you using ?

Please take a look at:
https://issues.apache.org/jira/browse/SPARK-5594

Cheers

On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie 
wrote:

> Hi Spark users,
>
> I'm seeing the below exceptions once in a while which causes tasks to fail
> (even after retries, so it is a non recoverable exception I think), hence
> stage fails and then the job gets aborted.
>
> Exception ---
> java.io.IOException: org.apache.spark.SparkException: Failed to get
> broadcast_10_piece0 of broadcast_10
>
> Any idea why this exception occurs and how to avoid/handle these
> exceptions? Please let me know if you have seen this exception and know a
> fix for it.
>
> Thanks,
> Bharath
>

Re: SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Spark Newbie

Using Spark-1.4.1

On Wed, Nov 25, 2015 at 4:19 PM, Ted Yu  wrote:

> Which Spark release are you using ?
>
> Please take a look at:
> https://issues.apache.org/jira/browse/SPARK-5594
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie 
> wrote:
>
>> Hi Spark users,
>>
>> I'm seeing the below exceptions once in a while which causes tasks to
>> fail (even after retries, so it is a non recoverable exception I think),
>> hence stage fails and then the job gets aborted.
>>
>> Exception ---
>> java.io.IOException: org.apache.spark.SparkException: Failed to get
>> broadcast_10_piece0 of broadcast_10
>>
>> Any idea why this exception occurs and how to avoid/handle these
>> exceptions? Please let me know if you have seen this exception and know a
>> fix for it.
>>
>> Thanks,
>> Bharath
>>
>
>

Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath

Thanks Jeff,

rowNumber is a function in org.apache.spark.sql.functions link


I will try to use monotonicallyIncreasingId and see if it works.

You’d better to use join to correlate 2 data frames : Yes, thats why I
thought of adding row number in both the DataFrames and join them based on
row number. Is there any better way of doing this? Both DataFrames will
have same number of rows always, but are not related by any column to do
join.

Thanks and Regards,
Vishnu Viswanath


On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang  wrote:

> >>> I tried to use df.withColumn but I am getting below exception.
>
> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
> for generating id
>
> >>> Also, is it possible to add a column from one dataframe to another?
>
> You can't, because how can you add one dataframe to another if they have
> different number of rows. You'd better to use join to correlate 2 data
> frames.
>
> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to add the row number to a spark dataframe.
>> This is my dataframe:
>>
>> scala> df.printSchema
>> root
>> |-- line: string (nullable = true)
>>
>> I tried to use df.withColumn but I am getting below exception.
>>
>> scala> df.withColumn("row",rowNumber)
>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
>> [line#2326,'row_number() AS row#2327];
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>>
>> Also, is it possible to add a column from one dataframe to another?
>> something like
>>
>> scala> df.withColumn("line2",df2("line"))
>>
>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
>> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
>> line2#2331];
>>
>> 
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>> *www.vishnuviswanath.com *
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Thanks and Regards,
Vishnu Viswanath
+1 309 550 2311
*www.vishnuviswanath.com *

Re: Adding more slaves to a running cluster

2015-11-25 Thread Andy Davidson

Hi Dillian and Nicholas

If you figure out how to do this please post your recipe. It would be very
useful 

andy

From:  Nicholas Chammas 
Date:  Wednesday, November 25, 2015 at 11:36 AM
To:  Dillian Murphey , "user @spark"

Subject:  Re: Adding more slaves to a running cluster

> spark-ec2 does not directly support adding instances to an existing cluster,
> apart from the special case of adding slaves to a cluster with a master but no
> slaves. There is an open issue to track adding this support, SPARK-2008
>  , but it doesn't have any
> momentum at the moment.
> 
> Your best bet currently is to do what you did and hack your way through using
> spark-ec2's various scripts.
> 
> You probably already know this, but to be clear, note that Spark itself
> supports adding slaves to a running cluster. It's just that spark-ec2 hasn't
> implemented a feature to do this work for you.
> 
> Nick
> 
> On Wed, Nov 25, 2015 at 2:27 PM Dillian Murphey 
> wrote:
>> It appears start-slave.sh works on a running cluster.  I'm surprised I can't
>> find more info on this. Maybe I'm not looking hard enough?
>> 
>> Using AWS and spot instances is incredibly more efficient, which begs for the
>> need of dynamically adding more nodes while the cluster is up, yet everything
>> I've found so far seems to indicate it isn't supported yet.
>> 
>> But yet here I am with 1.5 and it at least appears to be working. Am I
>> missing something?
>> 
>> On Tue, Nov 24, 2015 at 4:40 PM, Dillian Murphey 
>> wrote:
>>> What's the current status on adding slaves to a running cluster?  I want to
>>> leverage spark-ec2 and autoscaling groups.  I want to launch slaves as spot
>>> instances when I need to do some heavy lifting, but I don't want to bring
>>> down my cluster in order to add nodes.
>>> 
>>> Can this be done by just running start-slave.sh??
>>> 
>>> What about using Mesos?
>>> 
>>> I just want to create an AMI for a slave and on some trigger launch it and
>>> have it automatically add itself to the cluster.
>>> 
>>> thanks
>>

Obtaining Job Id for query submitted via Spark Thrift Server

2015-11-25 Thread Jagrut Sharma

Is there a way to get the Job Id for a query submitted via the Spark Thrift
Server? This would allow checking the status of that specific job via the
History Server.

Currently, I'm getting status of all jobs, and then filtering the results.
Looking for a more efficient approach.

Test environment is: Spark 1.4.1, Hive 0.13.1, running on YARN

Thanks!
-- 
Jagrut

Re: Issue with spark on hive

2015-11-25 Thread rugalcrimson

Not yet, but I found that it works well in Spark 1.4.1.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-spark-on-hive-tp25372p25485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath

Thanks Ted,

It looks like I cannot use row_number then. I tried to run a sample window
function and got below error
org.apache.spark.sql.AnalysisException: Could not resolve window function
'avg'. Note that, using window functions currently requires a HiveContext;

On Wed, Nov 25, 2015 at 8:28 PM, Ted Yu  wrote:

Vishnu:
> rowNumber (deprecated, replaced with row_number) is a window function.
>
>* Window function: returns a sequential number starting at 1 within a
> window partition.
>*
>* @group window_funcs
>* @since 1.6.0
>*/
>   def row_number(): Column = withExpr {
> UnresolvedWindowFunction("row_number", Nil) }
>
> Sample usage:
>
> df =  sqlContext.range(1<<20)
> df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
> ws = Window.partitionBy(df2.A).orderBy(df2.B)
> df3 = df2.select("client", "date",
> rowNumber().over(ws).alias("rn")).filter("rn < 0")
>
> Cheers
>
> On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Thanks Jeff,
>>
>> rowNumber is a function in org.apache.spark.sql.functions link
>> 
>>
>> I will try to use monotonicallyIncreasingId and see if it works.
>>
>> You’d better to use join to correlate 2 data frames : Yes, thats why I
>> thought of adding row number in both the DataFrames and join them based on
>> row number. Is there any better way of doing this? Both DataFrames will
>> have same number of rows always, but are not related by any column to do
>> join.
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>> 
>>
>> On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang  wrote:
>>
>>> >>> I tried to use df.withColumn but I am getting below exception.
>>>
>>> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
>>> for generating id
>>>
>>> >>> Also, is it possible to add a column from one dataframe to another?
>>>
>>> You can't, because how can you add one dataframe to another if they have
>>> different number of rows. You'd better to use join to correlate 2 data
>>> frames.
>>>
>>> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
 Hi,

 I am trying to add the row number to a spark dataframe.
 This is my dataframe:

 scala> df.printSchema
 root
 |-- line: string (nullable = true)

 I tried to use df.withColumn but I am getting below exception.

 scala> df.withColumn("row",rowNumber)
 org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
 [line#2326,'row_number() AS row#2327];
 at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
 at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
 at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

 Also, is it possible to add a column from one dataframe to another?
 something like

 scala> df.withColumn("line2",df2("line"))

 org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
 missing from line#2326 in operator !Project [line#2326,line#2330 AS 
 line2#2331];

 

 Thanks and Regards,
 Vishnu Viswanath
 *www.vishnuviswanath.com *

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Vishnu Viswanath
>> +1 309 550 2311
>> *www.vishnuviswanath.com *
>>
>
> 
-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com *

send this email to unsubscribe

2015-11-25 Thread ngocan211 .

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Tathagata Das

What do you mean by killing the streaming job using UI? Do you mean that
you are clicking the "kill" link in the Jobs page in the Spark UI?

Also in the application, is the main thread waiting on
streamingContext.awaitTermination()? That is designed to catch exceptions
in running job and throw it in the main thread, so that the java program
exits with an exception and non-zero exit code.




On Wed, Nov 25, 2015 at 12:57 PM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> I am killing my Streaming job using UI. What error code does UI provide if
> the job is killed from there?
>
> On Wed, Nov 25, 2015 at 11:01 AM, Kay-Uwe Moosheimer 
> wrote:
>
>> Testet with Spark 1.5.2 … Works perfect when exit code is non-zero.
>> And does not Restart with exit code equals zero.
>>
>>
>> Von: Prem Sure 
>> Datum: Mittwoch, 25. November 2015 19:57
>> An: SRK 
>> Cc: 
>> Betreff: Re: Automatic driver restart does not seem to be working in
>> Spark Standalone
>>
>> I think automatic driver restart will happen, if driver fails with
>> non-zero exit code.
>>
>>   --deploy-mode cluster
>>   --supervise
>>
>>
>>
>> On Wed, Nov 25, 2015 at 1:46 PM, SRK  wrote:
>>
>>> Hi,
>>>
>>> I am submitting my Spark job with supervise option as shown below. When I
>>> kill the driver and the app from UI, the driver does not restart
>>> automatically. This is in a cluster mode.  Any suggestion on how to make
>>> Automatic Driver Restart work would be of great help.
>>>
>>> --supervise
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-does-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Adding new column to Dataframe

2015-11-25 Thread Ted Yu

Vishnu:
rowNumber (deprecated, replaced with row_number) is a window function.

   * Window function: returns a sequential number starting at 1 within a
window partition.
   *
   * @group window_funcs
   * @since 1.6.0
   */
  def row_number(): Column = withExpr {
UnresolvedWindowFunction("row_number", Nil) }

Sample usage:

df =  sqlContext.range(1<<20)
df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
ws = Window.partitionBy(df2.A).orderBy(df2.B)
df3 = df2.select("client", "date",
rowNumber().over(ws).alias("rn")).filter("rn < 0")

Cheers

On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Thanks Jeff,
>
> rowNumber is a function in org.apache.spark.sql.functions link
> 
>
> I will try to use monotonicallyIncreasingId and see if it works.
>
> You’d better to use join to correlate 2 data frames : Yes, thats why I
> thought of adding row number in both the DataFrames and join them based on
> row number. Is there any better way of doing this? Both DataFrames will
> have same number of rows always, but are not related by any column to do
> join.
>
> Thanks and Regards,
> Vishnu Viswanath
> 
>
> On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang  wrote:
>
>> >>> I tried to use df.withColumn but I am getting below exception.
>>
>> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
>> for generating id
>>
>> >>> Also, is it possible to add a column from one dataframe to another?
>>
>> You can't, because how can you add one dataframe to another if they have
>> different number of rows. You'd better to use join to correlate 2 data
>> frames.
>>
>> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to add the row number to a spark dataframe.
>>> This is my dataframe:
>>>
>>> scala> df.printSchema
>>> root
>>> |-- line: string (nullable = true)
>>>
>>> I tried to use df.withColumn but I am getting below exception.
>>>
>>> scala> df.withColumn("row",rowNumber)
>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
>>> [line#2326,'row_number() AS row#2327];
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>>>
>>> Also, is it possible to add a column from one dataframe to another?
>>> something like
>>>
>>> scala> df.withColumn("line2",df2("line"))
>>>
>>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
>>> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
>>> line2#2331];
>>>
>>> 
>>>
>>> Thanks and Regards,
>>> Vishnu Viswanath
>>> *www.vishnuviswanath.com *
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Thanks and Regards,
> Vishnu Viswanath
> +1 309 550 2311
> *www.vishnuviswanath.com *
>

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Kay-Uwe Moosheimer

Testet with Spark 1.5.2  Works perfect when exit code is non-zero.
And does not Restart with exit code equals zero.


Von:  Prem Sure 
Datum:  Mittwoch, 25. November 2015 19:57
An:  SRK 
Cc:  
Betreff:  Re: Automatic driver restart does not seem to be working in Spark
Standalone

I think automatic driver restart will happen, if driver fails with non-zero
exit code.
  --deploy-mode cluster
  --supervise


On Wed, Nov 25, 2015 at 1:46 PM, SRK  wrote:
> Hi,
> 
> I am submitting my Spark job with supervise option as shown below. When I
> kill the driver and the app from UI, the driver does not restart
> automatically. This is in a cluster mode.  Any suggestion on how to make
> Automatic Driver Restart work would be of great help.
> 
> --supervise
> 
> 
> Thanks,
> Swetha
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-d
> oes-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Queue in Spark standalone mode

2015-11-25 Thread sunil m

Hi!

I am using Spark 1.5.1 and pretty new to Spark...

Like Yarn, is there a way to configure queues in Spark standalone mode?
If yes, can someone point me to a good documentation / reference.

Sometimes  I get strange behavior while running  multiple jobs
simultaneously.

Thanks in advance.

Warm regards,
Sunil

Re: sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos

Found the problem. Control-M characters. Please ignore the post

On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos 
wrote:

> Hello,
>
> I have a text file consisting of 483150 lines (wc -l "my_file.txt").
>
> However when I read it using textFile:
>
> %pyspark
> rdd = sc.textFile("my_file.txt")
> print rdd.count()
>
> it returns 554420 lines. Any idea why this is happening? Is it using a
> different new line delimiter and how this can be changed?
>
> Thank you,
> George
>
>
>
>
>

Re: Adding more slaves to a running cluster

2015-11-25 Thread Dillian Murphey

It appears start-slave.sh works on a running cluster.  I'm surprised I
can't find more info on this. Maybe I'm not looking hard enough?

Using AWS and spot instances is incredibly more efficient, which begs for
the need of dynamically adding more nodes while the cluster is up, yet
everything I've found so far seems to indicate it isn't supported yet.

But yet here I am with 1.5 and it at least appears to be working. Am I
missing something?

On Tue, Nov 24, 2015 at 4:40 PM, Dillian Murphey 
wrote:

> What's the current status on adding slaves to a running cluster?  I want
> to leverage spark-ec2 and autoscaling groups.  I want to launch slaves as
> spot instances when I need to do some heavy lifting, but I don't want to
> bring down my cluster in order to add nodes.
>
> Can this be done by just running start-slave.sh??
>
> What about using Mesos?
>
> I just want to create an AMI for a slave and on some trigger launch it and
> have it automatically add itself to the cluster.
>
> thanks
>

Re: Adding more slaves to a running cluster

2015-11-25 Thread Nicholas Chammas

spark-ec2 does not directly support adding instances to an existing
cluster, apart from the special case of adding slaves to a cluster with a
master but no slaves. There is an open issue to track adding this support,
SPARK-2008 , but it
doesn't have any momentum at the moment.

Your best bet currently is to do what you did and hack your way through
using spark-ec2's various scripts.

You probably already know this, but to be clear, note that Spark itself
supports adding slaves to a running cluster. It's just that spark-ec2
hasn't implemented a feature to do this work for you.

Nick

On Wed, Nov 25, 2015 at 2:27 PM Dillian Murphey 
wrote:

> It appears start-slave.sh works on a running cluster.  I'm surprised I
> can't find more info on this. Maybe I'm not looking hard enough?
>
> Using AWS and spot instances is incredibly more efficient, which begs for
> the need of dynamically adding more nodes while the cluster is up, yet
> everything I've found so far seems to indicate it isn't supported yet.
>
> But yet here I am with 1.5 and it at least appears to be working. Am I
> missing something?
>
> On Tue, Nov 24, 2015 at 4:40 PM, Dillian Murphey 
> wrote:
>
>> What's the current status on adding slaves to a running cluster?  I want
>> to leverage spark-ec2 and autoscaling groups.  I want to launch slaves as
>> spot instances when I need to do some heavy lifting, but I don't want to
>> bring down my cluster in order to add nodes.
>>
>> Can this be done by just running start-slave.sh??
>>
>> What about using Mesos?
>>
>> I just want to create an AMI for a slave and on some trigger launch it
>> and have it automatically add itself to the cluster.
>>
>> thanks
>>
>
>

Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread SRK

Hi,

I am submitting my Spark job with supervise option as shown below. When I
kill the driver and the app from UI, the driver does not restart
automatically. This is in a cluster mode.  Any suggestion on how to make
Automatic Driver Restart work would be of great help.

--supervise 


Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-does-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Prem Sure

I think automatic driver restart will happen, if driver fails with non-zero
exit code.

  --deploy-mode cluster
  --supervise



On Wed, Nov 25, 2015 at 1:46 PM, SRK  wrote:

> Hi,
>
> I am submitting my Spark job with supervise option as shown below. When I
> kill the driver and the app from UI, the driver does not restart
> automatically. This is in a cluster mode.  Any suggestion on how to make
> Automatic Driver Restart work would be of great help.
>
> --supervise
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-does-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Building Spark without hive libraries

2015-11-25 Thread Ted Yu

Take a look at install_zinc() in build/mvn

Cheers

On Wed, Nov 25, 2015 at 1:30 PM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> I am trying to build sparc from the source and not using Hive. I am
> getting
>
>
>
> [error] Required file not found: scala-compiler-2.10.4.jar
>
> [error] See zinc -help for information about locating necessary files
>
>
>
> I have to run this as root otherwise build does not progress. Any help is
> appreciated.
>
>
>
>
>
> -bash-3.2#  ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"
>
> +++ dirname ./make-distribution.sh
>
> ++ cd .
>
> ++ pwd
>
> + SPARK_HOME=/usr/lib/spark
>
> + DISTDIR=/usr/lib/spark/dist
>
> + SPARK_TACHYON=false
>
> + TACHYON_VERSION=0.7.1
>
> + TACHYON_TGZ=tachyon-0.7.1-bin.tar.gz
>
> + TACHYON_URL=
> https://github.com/amplab/tachyon/releases/download/v0.7.1/tachyon-0.7.1-bin.tar.gz
>
> + MAKE_TGZ=false
>
> + NAME=none
>
> + MVN=/usr/lib/spark/build/mvn
>
> + ((  4  ))
>
> + case $1 in
>
> + NAME=hadoop2-without-hive
>
> + shift
>
> + shift
>
> + ((  2  ))
>
> + case $1 in
>
> + MAKE_TGZ=true
>
> + shift
>
> + ((  1  ))
>
> + case $1 in
>
> + break
>
> + '[' -z /usr/java/latest ']'
>
> + '[' -z /usr/java/latest ']'
>
> ++ command -v git
>
> + '[' ']'
>
> ++ command -v /usr/lib/spark/build/mvn
>
> + '[' '!' /usr/lib/spark/build/mvn ']'
>
> ++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=project.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + VERSION=1.5.2
>
> ++ /usr/lib/spark/build/mvn help:evaluate
> -Dexpression=scala.binary.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + SCALA_VERSION=2.10
>
> ++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=hadoop.version
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> + SPARK_HADOOP_VERSION=2.6.0
>
> ++ /usr/lib/spark/build/mvn help:evaluate
> -Dexpression=project.activeProfiles -pl sql/hive
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> ++ grep -v INFO
>
> ++ fgrep --count 'hive'
>
> ++ echo -n
>
> + SPARK_HIVE=0
>
> + '[' hadoop2-without-hive == none ']'
>
> + echo 'Spark version is 1.5.2'
>
> Spark version is 1.5.2
>
> + '[' true == true ']'
>
> + echo 'Making spark-1.5.2-bin-hadoop2-without-hive.tgz'
>
> Making spark-1.5.2-bin-hadoop2-without-hive.tgz
>
> + '[' false == true ']'
>
> + echo 'Tachyon Disabled'
>
> Tachyon Disabled
>
> + cd /usr/lib/spark
>
> + export 'MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
> -XX:ReservedCodeCacheSize=512m'
>
> + MAVEN_OPTS='-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m'
>
> + BUILD_COMMAND=("$MVN" clean package -DskipTests $@)
>
> + echo -e '\nBuilding with...'
>
>
>
> Building with...
>
> + echo -e '$ /usr/lib/spark/build/mvn' clean package -DskipTests
> '-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided\n'
>
> $ /usr/lib/spark/build/mvn clean package -DskipTests
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
>
>
> + /usr/lib/spark/build/mvn clean package -DskipTests
> -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided
>
> Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn
>
> [INFO] Scanning for projects...
>
> [INFO]
> 
>
> [INFO] Reactor Build Order:
>
> [INFO]
>
> [INFO] Spark Project Parent POM
>
> [INFO] Spark Project Launcher
>
> [INFO] Spark Project Networking
>
> [INFO] Spark Project Shuffle Streaming Service
>
> [INFO] Spark Project Unsafe
>
> [INFO] Spark Project Core
>
> [INFO] Spark Project Bagel
>
> [INFO] Spark Project GraphX
>
> [INFO] Spark Project Streaming
>
> [INFO] Spark Project Catalyst
>
> [INFO] Spark Project SQL
>
> [INFO] Spark Project ML Library
>
> [INFO] Spark Project Tools
>
> [INFO] Spark Project Hive
>
> [INFO] Spark Project REPL
>
> [INFO] Spark Project YARN
>
> [INFO] Spark Project Assembly
>
> [INFO] Spark Project External Twitter
>
> [INFO] Spark Project External Flume Sink
>
> [INFO] Spark Project External Flume
>
> [INFO] Spark Project External Flume Assembly
>
> [INFO] Spark Project External MQTT
>
> [INFO] Spark Project External MQTT Assembly
>
> [INFO] Spark Project External ZeroMQ
>
> [INFO] Spark Project External Kafka
>
> [INFO] Spark Project Examples
>
> [INFO] Spark Project External Kafka Assembly
>
> [INFO] Spark Project YARN Shuffle Service
>
> [INFO]
>
> [INFO]
> 
>
> [INFO] Building Spark Project Parent POM 1.5.2
>
> [INFO]
> 
>
> [INFO]
>
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
> spark-parent_2.10 ---
>
> [INFO] Deleting /usr/lib/spark/target
>
> [INFO]
>
> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
> spark-parent_2.10 ---
>
> [INFO]
>
> [INFO] ---

Re: Queue in Spark standalone mode

2015-11-25 Thread Prem Sure

spark standalone mode submitted applications will run in FIFO
(first-in-first-out) order. please elaborate "strange behavior while
running  multiple jobs simultaneously."

On Wed, Nov 25, 2015 at 2:29 PM, sunil m <260885smanik...@gmail.com> wrote:

> Hi!
>
> I am using Spark 1.5.1 and pretty new to Spark...
>
> Like Yarn, is there a way to configure queues in Spark standalone mode?
> If yes, can someone point me to a good documentation / reference.
>
> Sometimes  I get strange behavior while running  multiple jobs
> simultaneously.
>
> Thanks in advance.
>
> Warm regards,
> Sunil
>

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread swetha kasireddy

I am killing my Streaming job using UI. What error code does UI provide if
the job is killed from there?

On Wed, Nov 25, 2015 at 11:01 AM, Kay-Uwe Moosheimer 
wrote:

> Testet with Spark 1.5.2 … Works perfect when exit code is non-zero.
> And does not Restart with exit code equals zero.
>
>
> Von: Prem Sure 
> Datum: Mittwoch, 25. November 2015 19:57
> An: SRK 
> Cc: 
> Betreff: Re: Automatic driver restart does not seem to be working in
> Spark Standalone
>
> I think automatic driver restart will happen, if driver fails with
> non-zero exit code.
>
>   --deploy-mode cluster
>   --supervise
>
>
>
> On Wed, Nov 25, 2015 at 1:46 PM, SRK  wrote:
>
>> Hi,
>>
>> I am submitting my Spark job with supervise option as shown below. When I
>> kill the driver and the app from UI, the driver does not restart
>> automatically. This is in a cluster mode.  Any suggestion on how to make
>> Automatic Driver Restart work would be of great help.
>>
>> --supervise
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-does-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh

Hi,

 

I am trying to build sparc from the source and not using Hive. I am getting 

 

[error] Required file not found: scala-compiler-2.10.4.jar

[error] See zinc -help for information about locating necessary files

 

I have to run this as root otherwise build does not progress. Any help is
appreciated.

 

 

-bash-3.2#  ./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"

+++ dirname ./make-distribution.sh

++ cd .

++ pwd

+ SPARK_HOME=/usr/lib/spark

+ DISTDIR=/usr/lib/spark/dist

+ SPARK_TACHYON=false

+ TACHYON_VERSION=0.7.1

+ TACHYON_TGZ=tachyon-0.7.1-bin.tar.gz

+
TACHYON_URL=https://github.com/amplab/tachyon/releases/download/v0.7.1/tachy
on-0.7.1-bin.tar.gz

+ MAKE_TGZ=false

+ NAME=none

+ MVN=/usr/lib/spark/build/mvn

+ ((  4  ))

+ case $1 in

+ NAME=hadoop2-without-hive

+ shift

+ shift

+ ((  2  ))

+ case $1 in

+ MAKE_TGZ=true

+ shift

+ ((  1  ))

+ case $1 in

+ break

+ '[' -z /usr/java/latest ']'

+ '[' -z /usr/java/latest ']'

++ command -v git

+ '[' ']'

++ command -v /usr/lib/spark/build/mvn

+ '[' '!' /usr/lib/spark/build/mvn ']'

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=project.version
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ tail -n 1

+ VERSION=1.5.2

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=scala.binary.version
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ tail -n 1

+ SCALA_VERSION=2.10

++ /usr/lib/spark/build/mvn help:evaluate -Dexpression=hadoop.version
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ tail -n 1

+ SPARK_HADOOP_VERSION=2.6.0

++ /usr/lib/spark/build/mvn help:evaluate
-Dexpression=project.activeProfiles -pl sql/hive
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

++ grep -v INFO

++ fgrep --count 'hive'

++ echo -n

+ SPARK_HIVE=0

+ '[' hadoop2-without-hive == none ']'

+ echo 'Spark version is 1.5.2'

Spark version is 1.5.2

+ '[' true == true ']'

+ echo 'Making spark-1.5.2-bin-hadoop2-without-hive.tgz'

Making spark-1.5.2-bin-hadoop2-without-hive.tgz

+ '[' false == true ']'

+ echo 'Tachyon Disabled'

Tachyon Disabled

+ cd /usr/lib/spark

+ export 'MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m'

+ MAVEN_OPTS='-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m'

+ BUILD_COMMAND=("$MVN" clean package -DskipTests $@)

+ echo -e '\nBuilding with...'

 

Building with...

+ echo -e '$ /usr/lib/spark/build/mvn' clean package -DskipTests
'-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided\n'

$ /usr/lib/spark/build/mvn clean package -DskipTests
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

 

+ /usr/lib/spark/build/mvn clean package -DskipTests
-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided

Using `mvn` from path: /usr/lib/spark/build/apache-maven-3.3.3/bin/mvn

[INFO] Scanning for projects...

[INFO]


[INFO] Reactor Build Order:

[INFO]

[INFO] Spark Project Parent POM

[INFO] Spark Project Launcher

[INFO] Spark Project Networking

[INFO] Spark Project Shuffle Streaming Service

[INFO] Spark Project Unsafe

[INFO] Spark Project Core

[INFO] Spark Project Bagel

[INFO] Spark Project GraphX

[INFO] Spark Project Streaming

[INFO] Spark Project Catalyst

[INFO] Spark Project SQL

[INFO] Spark Project ML Library

[INFO] Spark Project Tools

[INFO] Spark Project Hive

[INFO] Spark Project REPL

[INFO] Spark Project YARN

[INFO] Spark Project Assembly

[INFO] Spark Project External Twitter

[INFO] Spark Project External Flume Sink

[INFO] Spark Project External Flume

[INFO] Spark Project External Flume Assembly

[INFO] Spark Project External MQTT

[INFO] Spark Project External MQTT Assembly

[INFO] Spark Project External ZeroMQ

[INFO] Spark Project External Kafka

[INFO] Spark Project Examples

[INFO] Spark Project External Kafka Assembly

[INFO] Spark Project YARN Shuffle Service

[INFO]

[INFO]


[INFO] Building Spark Project Parent POM 1.5.2

[INFO]


[INFO]

[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
spark-parent_2.10 ---

[INFO] Deleting /usr/lib/spark/target

[INFO]

[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
spark-parent_2.10 ---

[INFO]

[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
spark-parent_2.10 ---

[INFO] Add Source directory: /usr/lib/spark/src/main/scala

[INFO] Add Test Source directory: /usr/lib/spark/src/test/scala

[INFO]

[INFO] --- maven-remote-resources-plugin:1.5:process (default) @
spark-parent_2.10 ---

[INFO]

[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-parent_2.10 ---

[INFO] No sources to compile

[INFO]

[INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @ spark-parent_2.10
---

[INFO] Executing tasks

Spark- Cassandra Connector Error

2015-11-25 Thread ahlusar

Hello,

I receive the following error when I attempt to connect to a Cassandra
keyspace and table: WARN NettyUtil: 

"Found Netty's native epoll transport, but not running on linux-based
operating system. Using NIO instead"

The full details and log can be viewed here:
http://stackoverflow.com/questions/33896937/spark-connector-error-warn-nettyutil-found-nettys-native-epoll-transport-but


Thank you for your help and for your support. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-Connector-Error-tp25483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to gracefully shutdown a Spark Streaming Job?

2015-11-25 Thread Ted Yu

You can utilize this method from StreamingContext :
  def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {

Cheers

On Wed, Nov 25, 2015 at 1:16 PM, SRK  wrote:

> Hi,
>
> How to gracefully shutdown a Spark Streaming Job? Currently I just kill it
> from my Spark StandAlone cluster/ master UI. But, in that case automatic
> Driver Restarts do not seem to be kicking off and there are a few options
> listed here. What is the correct way of doing this?
>
>
>
> http://stackoverflow.com/questions/32582730/how-do-i-stop-a-spark-streaming-job
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-gracefully-shutdown-a-Spark-Streaming-Job-tp25482.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: UDF with 2 arguments

2015-11-25 Thread Davies Liu

It works in master (1.6), what's the version of Spark you have?

>>> from pyspark.sql.functions import udf
>>> def f(a, b): pass
...
>>> my_udf = udf(f)
>>> from pyspark.sql.types import *
>>> my_udf = udf(f, IntegerType())


On Wed, Nov 25, 2015 at 12:01 PM, Daniel Lopes  wrote:
> Hallo,
>
> supose I have function in pyspark that
>
> def function(arg1,arg2):
>   pass
>
> and
>
> udf_function = udf(function, IntegerType())
>
> that takes me error
>
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: __init__() takes at least 2 arguments (1 given)
>
>
> How I use?
>
> Best,
>
>
> --
> Daniel Lopes, B.Eng
> Data Scientist - BankFacil
> CREA/SP 5069410560
> Mob +55 (18) 99764-2733
> Ph +55 (11) 3522-8009
> http://about.me/dannyeuu
>
> Av. Nova Independência, 956, São Paulo, SP
> Bairro Brooklin Paulista
> CEP 04570-001
> https://www.bankfacil.com.br
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark : merging object with approximation

2015-11-25 Thread Karl Higley

Hi,

What merge behavior do you want when A~=B, B~=C but A!=C? Should the merge
emit ABC? AB and BC? Something else?

Best,
Karl

On Sat, Nov 21, 2015 at 5:24 AM OcterA  wrote:

> Hello,
>
> I have a set of X data (around 30M entry), I have to do a batch to merge
> data which are similar, at the end I will have around X/2 data.
>
> At this moment, i've done the basis : open files, map to an usable Ojbect,
> but I'm stuck at the merge part...
>
> The merge condition is composed from various conditions
>
> A.get*Start*Point == B.get*End*Point
> Difference between A.getStartDate and B.getStartDate is less than X1
> second
> Difference between A.getEndDate and B.getEndDate is less than X2 second
> A.getField1 startWith B.getField1
> some more like that...
>
> Suddenly, I can have A~=B, B~=C but A!=C. For my Spark comprehension, this
> is a problem, because I can have an hash to reduce greatly the scan time...
>
> Have you some advice, to resolve my problem, or pointers on method which
> can
> help me? Maybe an another tools from the Hadoop ecosystem?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-merging-object-with-approximation-tp25445.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark REST Job server feedback?

2015-11-25 Thread Ruslan Dautkhanov

Very good question. From
http://gethue.com/new-notebook-application-for-spark-sql/

"Use Livy Spark Job Server from the Hue master repository instead of CDH
(it is currently much more advanced): see build & start the latest Livy

"

Although that post is from April 2015, not sure if it's still accurate.




-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar  wrote:

> Hi
>
> I had the same question. Anyone having used Livy and/opr SparkJobServer,
> would like their input.
>
> Regards
> Deenar
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
>
> On 8 October 2015 at 20:39, Tim Smith  wrote:
>
>> I am curious too - any comparison between the two. Looks like one is
>> Datastax sponsored and the other is Cloudera. Other than that, any
>> major/core differences in design/approach?
>>
>> Thanks,
>>
>> Tim
>>
>>
>> On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal <
>> ramirezquetza...@gmail.com> wrote:
>>
>>> Anyone has feedback on using Hue / Spark Job Server REST servers?
>>>
>>>
>>> http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/
>>>
>>> https://github.com/spark-jobserver/spark-jobserver
>>>
>>> Many thanks,
>>>
>>> Rami
>>>
>>
>>
>>
>> --
>>
>> --
>> Thanks,
>>
>> Tim
>>
>
>

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov

public long getTime()

Returns the number of milliseconds since January 1, 1970, 00:00:00 GMT
represented by this Date object.

http://docs.oracle.com/javase/7/docs/api/java/util/Date.html#getTime%28%29

Based on what you did i might be easier to get date partitioner from that.
Also, to get even more even distriubution you could use a hash function
from that not just a remainder.




-- 
Ruslan Dautkhanov

On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin 
wrote:

> I will answer my own question, since I figured it out.  Here is my answer
> in case anyone else has the same issue.
>
> My DateTimes were all without seconds and milliseconds since I wanted to
> group data belonging to the same minute. The hashCode() for Joda DateTimes
> which are one minute apart is a constant:
>
> scala> val now = DateTime.now
> now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
>
> scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - 
> now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
> res42: Int = 6
>
> As can be seen by this example, if the hashCode values are similarly
> spaced, they can end up in the same partition:
>
> scala> val nums = for(i <- 0 to 100) yield ((i*20 % 1000), i)
> nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), 
> (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), 
> (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), 
> (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), 
> (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), 
> (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), 
> (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), 
> (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), 
> (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), 
> (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), 
> (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
>
> scala> val rddNum = sc.parallelize(nums)
> rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at 
> parallelize at :23
>
> scala> val reducedNum = rddNum.reduceByKey(_+_)
> reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at 
> reduceByKey at :25
>
> scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, 
> true).collect.toList
>
> res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0)
>
> To distribute my data more evenly across the partitions I created my own
> custom Partitoiner:
>
> class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
>   def numPartitions: Int = rddNumPartitions
>   def getPartition(key: Any): Int = {
> key match {
>   case dateTime: DateTime =>
> val sum = dateTime.getYear + dateTime.getMonthOfYear +  
> dateTime.getDayOfMonth + dateTime.getMinuteOfDay  + dateTime.getSecondOfDay
> sum % numPartitions
>   case _ => 0
> }
>   }
> }
>
>
> On 20 November 2015 at 17:17, Patrick McGloin 
> wrote:
>
>> Hi,
>>
>> I have Spark application which contains the following segment:
>>
>> val reparitioned = rdd.repartition(16)
>> val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, 
>> endDate)
>> val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, 
>> kv._2))
>> val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
>>
>> When I run this with some logging this is what I see:
>>
>> reparitioned ==> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 
>> 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
>> filtered ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 
>> 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
>> mapped ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 
>> 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
>> reduced ==> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
>>
>> My logging is done using these two lines:
>>
>> val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, 
>> true)log.info(s"rdd ==> [${sizes.collect.toList}]")
>>
>> My question is why does my data end up in one partition after the
>> reduceByKey? After the filter it can be seen that the data is evenly
>> distributed, but the reduceByKey results in data in only one partition.
>>
>> Thanks,
>>
>> Patrick
>>
>
>

Re: Spark 1.5.2 JNI native library java.lang.UnsatisfiedLinkError

2015-11-25 Thread Oriol López Massaguer

LD_LIBRARY_PATH points to /opt/eTOX_spark/lib/ the location of the nattive
libraries.


Oriol.

2015-11-25 17:39 GMT+01:00 Ted Yu :

> In your spark-env, did you set LD_LIBRARY_PATH ?
>
> Cheers
>
> On Wed, Nov 25, 2015 at 7:32 AM, Oriol López Massaguer <
> oriol.lo...@gmail.com> wrote:
>
>> Hello;
>>
>> I'm trying to use a native library in Spark.
>>
>> I was using a simple standalone cluster with one master and worker.
>>
>> According to the documentation I edited the spark-defautls.conf by
>> setting:
>>
>> spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar
>> spark.driver.extraLibraryPath=/opt/eTOX_spark/lib/
>> spark.executor.extraLibraryPath=/opt/eTOX_spark/lib/
>>
>> In the path /opt/eTOX_spark/lib/ there are 3 so files wich are wrapped in
>> org.RDKit.jar.
>>
>> But when I try so submit a job that uses the native library I get:
>>
>> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>> org.RDKit.RDKFuncsJNI.RWMol_MolFromSmiles__SWIG_3(Ljava/lang/String;)J
>> at org.RDKit.RDKFuncsJNI.RWMol_MolFromSmiles__SWIG_3(Native Method)
>> at org.RDKit.RWMol.MolFromSmiles(RWMol.java:426)
>> at models.spark.sources.eTOX_DB$.main(eTOX.scala:54)
>> at models.spark.sources.eTOX_DB.main(eTOX.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
>> at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> I use the submit.sh with the following parameters:
>>
>>  /opt/spark/bin/spark-submit --verbose --class
>> "models.spark.sources.eTOX_DB"  --master
>> spark://localhost.localdomain:7077
>> target/scala-2.10/etox_spark_2.10-1.0.jar
>>
>> the full output is:
>>
>> Using properties file: /opt/spark/conf/spark-defaults.conf
>> Adding default property:
>> spark.driver.extraLibraryPath=/opt/eTOX_spark/lib/
>> Adding default property:
>> spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar
>> Adding default property:
>> spark.executor.extraLibraryPath=/opt/eTOX_spark/lib/
>> Parsed arguments:
>>   master  spark://localhost.localdomain:7077
>>   deployMode  null
>>   executorMemory  null
>>   executorCores   null
>>   totalExecutorCores  null
>>   propertiesFile  /opt/spark/conf/spark-defaults.conf
>>   driverMemorynull
>>   driverCores null
>>   driverExtraClassPath/opt/eTOX_spark/lib/org.RDKit.jar
>>   driverExtraLibraryPath  /opt/eTOX_spark/lib/
>>   driverExtraJavaOptions  null
>>   supervise   false
>>   queue   null
>>   numExecutorsnull
>>   files   null
>>   pyFiles null
>>   archivesnull
>>   mainClass   models.spark.sources.eTOX_DB
>>   primaryResource
>> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
>>   namemodels.spark.sources.eTOX_DB
>>   childArgs   []
>>   jarsnull
>>   packagesnull
>>   packagesExclusions  null
>>   repositoriesnull
>>   verbose true
>>
>> Spark properties used, including those specified through
>>  --conf and those from the properties file
>> /opt/spark/conf/spark-defaults.conf:
>>   spark.executor.extraLibraryPath -> /opt/eTOX_spark/lib/
>>   spark.driver.extraLibraryPath -> /opt/eTOX_spark/lib/
>>   spark.driver.extraClassPath -> /opt/eTOX_spark/lib/org.RDKit.jar
>>
>>
>> Main class:
>> models.spark.sources.eTOX_DB
>> Arguments:
>>
>> System properties:
>> spark.executor.extraLibraryPath -> /opt/eTOX_spark/lib/
>> spark.driver.extraLibraryPath -> /opt/eTOX_spark/lib/
>> SPARK_SUBMIT -> true
>> spark.app.name -> models.spark.sources.eTOX_DB
>> spark.jars ->
>> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
>> spark.submit.deployMode -> client
>> spark.master -> spark://localhost.localdomain:7077
>> spark.driver.extraClassPath -> /opt/eTOX_spark/lib/org.RDKit.jar
>> Classpath elements:
>> file:/opt/eTOX_spark/target/scala-2.10/etox_spark_2.10-1.0.jar
>>
>>
>> Buffer(/opt/jdk1.8.0_45/jre/lib/amd64/libzip.so)
>> Loading libraries
>> Buffer(/opt/jdk1.8.0_45/jre/lib/amd64/libzip.so, /opt/eTOX_spark/lib/
>> libboost_thread.1.48.0.so, /opt/eTOX_spark/lib/libboost_system.1.48.0.so,
>> /opt/eTOX_spark/lib/libGraphMolWrap.so)
>> Loading libraries
>> Using Spark's default log4j profile:
>>

Fwd: pyspark: Error when training a GMM with an initial GaussianMixtureModel

2015-11-25 Thread Guillaume Maze

Hi all,
We're trying to train a Gaussian Mixture Model (GMM) with a specified
initial model.
Doc 1.5.1 says we should use a GaussianMixtureModel object as input
for the "initialModel" parameter to the GaussianMixture.train method.
Before creating our own initial model (the plan is to use a Kmean
result for instance), we simply wanted to test case this scenario.
So we try to initialize a 2nd training using the GaussianMixtureModel
from the output a 1st training.
But this trivial scenario throws an error.
Could you please help us determine what's going on here ?
Thanks a lot
guillaume

PS: we run (py)spark 1.5.1 with hadoop 2.6

Below is the trivial scenario code and the error:

 SOURCE CODE
from pyspark.mllib.clustering import GaussianMixture
from numpy import array
import sys
import os
import pyspark

### Local default options
K=2 # "k" (int) Set the number of Gaussians in the mixture model.  Default:
2
convergenceTol=1e-3 # "convergenceTol" (double) Set the largest change in
log-likelihood at which convergence is considered to have occurred.
maxIterations=100 # "maxIterations" (int) Set the maximum number
of iterations to run. Default: 100
seed=None # "seed" (long) Set the random seed
initialModel=None

### Load and parse the sample data
data = sc.textFile("gmm_data.txt") # Data from the dummy set
here: data/mllib/gmm_data.txt
parsedData = data.map(lambda line: array([float(x) for x
in line.strip().split(' ')]))
print type(parsedData)
print type(parsedData.first())

### 1st training: Build the GMM
gmm = GaussianMixture.train(parsedData, K, convergenceTol, maxIterations,
seed, initialModel)

# output parameters of model
for i in range(2):
print ("weight = ", gmm.weights[i], "mu = ", gmm.gaussians[i].mu,
"sigma = ", gmm.gaussians[i].sigma.toArray())

### 2nd training: Re-build a GMM using an initial model
initialModel = gmm
print type(initialModel)
gmm = GaussianMixture.train(parsedData, K, convergenceTol, maxIterations,
seed, initialModel)


 OUTPUT WITH ERROR:


('weight = ', 0.51945003367044018, 'mu = ', DenseVector([-0.1045,
0.0429]), 'sigma = ', array([[ 4.90706817, -2.00676881],
   [-2.00676881,  1.01143891]]))
('weight = ', 0.48054996632955982, 'mu = ', DenseVector([0.0722,
0.0167]), 'sigma = ', array([[ 4.77975653,  1.87624558],
   [ 1.87624558,  0.91467242]]))


---
Py4JJavaError Traceback (most recent call last)
 in ()
 33 initialModel = gmm
 34 print type(initialModel)
---> 35 gmm = GaussianMixture.train(parsedData, K, convergenceTol,
maxIterations, seed, initialModel) #

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/clustering.pyc
in train(cls, rdd, k, convergenceTol, maxIterations, seed,
initialModel)
306 java_model =
callMLlibFunc("trainGaussianMixtureModel",
rdd.map(_convert_to_vector),
307k, convergenceTol,
maxIterations, seed,
--> 308initialModelWeights,
initialModelMu, initialModelSigma)
309 return GaussianMixtureModel(java_model)
310

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callMLlibFunc(name, *args)
128 sc = SparkContext._active_spark_context
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131
132

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callJavaFunc(sc, func, *args)
120 def callJavaFunc(sc, func, *args):
121 """ Call Java Function """
--> 122 args = [_py2java(sc, a) for a in args]
123 return _java2py(sc, func(*args))
124

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in _py2java(sc, obj)
 86 else:
 87 data = bytearray(PickleSerializer().dumps(obj))
---> 88 obj = sc._jvm.SerDe.loads(data)
 89 return obj
 90

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in
deco(*a, **kw)
 34 def deco(*a, **kw):
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:
 38 s = e.java_exception.toString()

/opt/spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-25 Thread filthysocks

jmvllt wrote
> Here, because the predicted class will always be 0 or 1, there is no way
> to vary the threshold to get the aucROC, right  Or am I totally wrong
> ? 

No, you are right. If you pass a (Score,Label) tuple to
BinaryClassificationMetrics, then Score has to be the class probability. 

Have you seen the clearThreshold function?

spark_docu wrote
> Clears the threshold so that predict will output raw prediction scores.

https://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

You probably need to call it before the predict call.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-relevant-to-use-BinaryClassificationMetrics-aucROC-aucPR-with-LogisticRegressionModel-tp25465p25473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread diplomatic Guru

Hello,

I know how I could clear the old state depending on the input value. If
some condition matches to determine that the state is old then set the
return null, will invalidate the record. But this is only feasible if a new
record arrives that matches the old key. What if no new data arrives for
the old data, how could I make that invalid.

e.g.

A key/Value arrives like this

Key 12-11-2015:10:00: Value:test,1,2,12-11-2015:10:00

Above key will be updated to state.

Every time there is a value for this '12-11-2015:10:00' key, it will be
aggregated and updated. If the job is running for 24/7, then this state
will be kept forever until we restart the job. But I could have a
validation within the updateStateByKey function to check and delete the
record if value[3]< SYSTIME-1. But this only effective if a new record
arrives that matches the 12-11-2015:10:00 in the later days. What if no new
values are received for this key:12-11-2015:10:00. I assume it will remain
in the state, am I correct? if so the how do I clear the state?

Thank you.

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-25 Thread jmvllt

Hi filthysocks,

Thanks for the answer. Indeed, using the clearThreshold() function solved my
problem :).

Regards,
Jean.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-relevant-to-use-BinaryClassificationMetrics-aucROC-aucPR-with-LogisticRegressionModel-tp25465p25475.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Todd Nist

Perhaps the new trackStateByKey targeted for very 1.6 may help you here.
I'm not sure if it is part of 1.6 or not for sure as the jira does not
specify a fixed version.  The jira describing it is here:
https://issues.apache.org/jira/browse/SPARK-2629, and the design doc that
discusses the API changes is here:

https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#

Look for the timeout  function:

/**

  * Set the duration of inactivity (i.e. no new data) after which a state

  * can be terminated by the system. After this idle period, the system

  * will mark the idle state as being timed out, and call the tracking

  * function with State[S].isTimingOut() = true.

  */

 def timeout(duration: Duration): this.type

-Todd

On Wed, Nov 25, 2015 at 8:00 AM, diplomatic Guru 
wrote:

> Hello,
>
> I know how I could clear the old state depending on the input value. If
> some condition matches to determine that the state is old then set the
> return null, will invalidate the record. But this is only feasible if a new
> record arrives that matches the old key. What if no new data arrives for
> the old data, how could I make that invalid.
>
> e.g.
>
> A key/Value arrives like this
>
> Key 12-11-2015:10:00: Value:test,1,2,12-11-2015:10:00
>
> Above key will be updated to state.
>
> Every time there is a value for this '12-11-2015:10:00' key, it will be
> aggregated and updated. If the job is running for 24/7, then this state
> will be kept forever until we restart the job. But I could have a
> validation within the updateStateByKey function to check and delete the
> record if value[3]< SYSTIME-1. But this only effective if a new record
> arrives that matches the 12-11-2015:10:00 in the later days. What if no new
> values are received for this key:12-11-2015:10:00. I assume it will remain
> in the state, am I correct? if so the how do I clear the state?
>
> Thank you.
>
>
>

Spark, Windows 7 python shell non-reachable ip address

2015-11-25 Thread Shuo Wang

After running these two lines in the Quick Start example in spark's python
shell on windows 7.

>>> textFile = sc.textFile("README.md")

>>> textFile.count()

I am getting the following error:

>>> textFile.count()

15/11/25 19:57:01 WARN : Your hostname, oh_t-PC resolves to a
 loopback/non-reachable address: fe80:0:0:0:84b:213f:3f57:fef6%net5,
but we couldn't find  any external IP address!
  Traceback (most recent call last):
  File "", line 1, in 
  File "C:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line
1006, in count
 

Any idea what is going wrong here?

-- 
王硕
邮箱：shuo.x.w...@gmail.com
Whatever your journey, keep walking.

Spark Driver Port Details

2015-11-25 Thread aman solanki

Hi,

Can anyone tell me how i can get the details that a particular spark
application is running on which particular port?

For Example:

I have two applications A and B

A is running on 4040
B is running on 4041

How can i get these application port mapping? Is there a rest call or
environment variable for the same?

Please share your findings for standalone mode.

Thanks,
Aman Solanki

Re: Spark Driver Port Details

2015-11-25 Thread Todd Nist

The default is to start applications with port 4040 and then increment them
by 1 as you are seeing, see docs here:
http://spark.apache.org/docs/latest/monitoring.html#web-interfaces

You can override this behavior by setting passing the  --conf
spark.ui.port=4080 or in your code; something like this:

val conf = new SparkConf().setAppName(s"YourApp").set("spark.ui.port", "4080")
val sc = new SparkContext(conf)

While there is a rest api to return you information on the application,
http://yourserver:8080/api/v1/applications, it does not return the port
used by the application.

-Todd

On Wed, Nov 25, 2015 at 9:15 AM, aman solanki 
wrote:

>
> Hi,
>
> Can anyone tell me how i can get the details that a particular spark
> application is running on which particular port?
>
> For Example:
>
> I have two applications A and B
>
> A is running on 4040
> B is running on 4041
>
> How can i get these application port mapping? Is there a rest call or
> environment variable for the same?
>
> Please share your findings for standalone mode.
>
> Thanks,
> Aman Solanki
>

Spark, Windows 7 python shell non-reachable ip address

2015-11-25 Thread Shuo Wang

Hi,

After running these two lines in the Quick Start example in spark's python
shell on windows 7.

>>> textFile = sc.textFile("README.md")

>>> textFile.count()

I am getting the following error:

>>> textFile.count()

15/11/25 19:57:01 WARN : Your hostname, oh_t-PC resolves to a
 loopback/non-reachable address: fe80:0:0:0:84b:213f:3f57:fef6%net5,
but we couldn't find  any external IP address!
  Traceback (most recent call last):
  File "", line 1, in 
  File "C:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line
1006, in count
 

Any idea what is going wrong here?

-- 
王硕
邮箱：shuo.x.w...@gmail.com
Whatever your journey, keep walking.

log4j custom appender ClassNotFoundException with spark 1.5.2

2015-11-25 Thread lev

Hi,

I'm using spark 1.5.2 and running on a yarn cluster
and trying to use a custom log4j appender 

in my setup there are 3 jars:
the uber jar: spark.yarn.jar=uber-jar.jar
the jar that contains the main class: main.jar
additional jar with dependencies: dep.jar (passed with the --jars flag to
spark submit)

I've tried defining my appender in each one of the jars:
in uber-jar: the appender is found and created successfully
in main.jar or dep.jar: throws ClassNotFoundException 

I guess log4j tries to load the class before the assemblies were loaded

it's related to this ticket:
https://issues.apache.org/jira/browse/SPARK-9826
but not the same, as the ticket talk about the appender being defined in the
uber-jar, and that case works for me

any ideas on how to solve this?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-5-2-tp25487.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JavaPairRDD.treeAggregate

2015-11-25 Thread amit tewari

Hi, does someone has experience/knowledge on using
JavaPairRDD.treeAggregate?
Even sample code will be helpful.
Not many articles etc. available on web.

Thanks
Amit

60 matches

Mail list logo